Cerebras, Petuum, & MBZUAI Announced Open Source CrystalCoder

CrystalCoder-7B and LLM360 Release Enable Next Era of Open-Source Contributions Through Reproducible Methodology Available to All AI Researchers and Practitioners

Cerebras Systems, the pioneer in accelerating generative AI, and Petuum, the generative AI company focused on building transparent LLMs, in partnership with MBZUAI launched CrystalCoder, a new 7 billion parameter model designed for English language and coding tasks. While previous models were suitable for either English or coding, CrystalCoder achieves high accuracy for both tasks simultaneously. Trained on Condor Galaxy 1, the AI supercomputer built by G42 and Cerebras, CrystalCoder-7B has been released under the new reproducible LLM360 methodology that promotes open source and transparent, responsible use. CrystalCoder and the LLM360 release methodology are available now on Hugging Face.

As open-source models gain parity with closed-source LLMs in terms of performance and accuracy, leading open-source developers are increasingly releasing checkpoints and recipes to help others study LLM training and promote collaboration and education across the research community. The LLM360 methodology, developed by Petuum, MBZUAI and Cerebras, is a novel approach to advancing transparency and safety by open sourcing more of the model ingredients so that work is reproducible by others. In addition to releasing weights under the Apache 2.0 license, the model recipe, and a paper, the LLM360 methodology open sources the training code, up to 360 training checkpoints, pre-processing scripts, data buckets, and analytics tools:

Model: Releases 360 checkpoints across the training run
Data: Provides access to the data buckets for each checkpoint
Code: Provides pre-processing, training, inference code, and analysis code (if applicable)
Metrics: All the training logs, evaluations and analysis results collected during training are publicly disclosed, also in correspondence to the training steps and data sequence

Also Read: Imbue to Develop Next-Generation AI Models with $150 Million Dell High Performance Computing System

“Cerebras is proud to be the inaugural hardware partner for LLM360 and, in partnership with Petuum, to release the first model under this methodology, CrystalCoder-7B,” said Andrew Feldman, CEO and co-founder, Cerebras Systems. “We believe that transparency and reproducibility matter as much as model quality for the safe advancement of AI. We look forward to seeing more models released to the open source in this manner.”

In coding tasks, CrystalCoder approaches StarCoder-base in accuracy while in language it is comparable to Llama and MPT-7B. The significance of this model is that it is optimal for both coding and language tasks, better at coding than the best language models, and better at language than the best coding models. While previously developers had to choose between coding or language, Crystal Code is optimal for both of these tasks.

“Petuum and MBZUAI are excited to announce the release of the CrystalCoder-7B large language model (LLM). This groundbreaking collaboration, strengthened by our partnership with Cerebras, marks a significant milestone in the field of advanced open-source LLMs. CrystalCoder stands out due to its meticulously balanced and carefully selected data sets, unparalleled performance and reliability on language and code tasks,” said Hector Liu, Head of Engineering at Petuum and LLM Team Lead at MBZUAI. “Additionally, the unique design of Cerebras’s Condor Galaxy 1 transforms previously daunting large-scale training challenges into manageable tasks, setting new standards for efficiency and effectiveness in LLM training.”

CrystalCoder-7B is the latest in a family of leading open-source models co-developed by Cerebras, including Jais 13B and Jais30B, the best bilingual Arabic models in the world created in partnership with Core42, now available on Azure Cloud. In June, Cerebras released BTLM-3B-8K, which is the number one leading 3B model on HuggingFace, offering 7B parameter performance in a light 3B parameter model for inference. Med42, developed with M42 and Core42, is a leading clinical LLM, trained on Condor Galaxy 1 in a weekend and surpassing MedPaLM on performance and accuracy. In March, Cerebras released the first open-source family of GPT models, named Cerebras-GPT, followed by the release of the SlimPajama dataset, the best LLM dataset with highest training efficiency.

SOURCE: BusinessWire

Cerebras, Petuum, and MBZUAI Announce New Open-Source CrystalCoder and LLM360 Methodology to Accelerate Development of Transparent and Responsible AI Models

About Us

Latest

Popular

Quick Link