Cerebras Systems and Barcelona Supercomputing Center Train Industry-Leading Multilingual Spanish Catalan English LLM

Condor Galaxy AI Supercomputer Powers FLOR 6.3B — a Catalan, Spanish and English Open-Source Model Using Novel Training Techniques

Cerebras Systems, the pioneer in accelerating generative AI, announced that the Barcelona Supercomputing Center (BSC) has completed training FLOR-6.3B, the state-of-the-art English Spanish Calatan large language model. FLOR-6.3B was trained in just 2.5 days on Condor Galaxy (CG-1), the massive AI supercomputer built from 64 Cerebras CS-2s by Cerebras and G42. FLOR-6.3B continues Cerebras’ leading work on multilingual models, a trend that started with the introduction of Jais, the leading Arabic English model.

As Catalan has a fraction of the data that is typically needed to train a model, innovative AI training techniques were created. Catalan and Spanish are low and mid-resourced languages relative to English. As explained in a recent post, BSC sought to create a model that was stronger for having three languages together, as each language is commonly spoken in Spain. In partnership with Cerebras, the BSC team explored a technique that used a fully-trained LLM and adjusted the embedding layer to achieve the same result as if it were trained using a large data set.

“Even though Spanish is one of the most commonly spoken languages in the world, there is a shortage of data available on the Internet for training – and we’ve found this to be a common problem for many languages beyond English,” said Andrew Feldman, CEO and co-founder of Cerebras. “In collaboration with our partners, we have been committed to developing new methodologies for creating models where training data is underrepresented. We are proud to work with BSC on FLOR 6.3B, which is multilingual at its core and performs significantly better than competing Spanish LLMs thanks to our novel training techniques.”

FLOR is a new family of open-source models, ranging in size from 760M to 6.3B parameters, that are based on publicly released checkpoints of BLOOM. These checkpoints have been previously pre-trained on 341B tokens of multilingual data, including 46 natural languages and 13 coding languages.

Also Read: Rescale & IonQ Launch Partnership to Accelerate Innovation through Hybrid Quantum Computing

Bloom-7.1B was taken as the initial checkpoint of the continuous pretraining due to its multilingual nature. To better adapt to Catalan and Spanish, a new tokenizer was trained and used in the continuous pretraining process. The new tokenizer has a reduced vocabulary set of 50,257 subwords, in which 66% were overlapping with the Bloom vocabulary set and the rest are subwords that are more prevalent in Catalan and Spanish. The reduction of the vocabulary size also resulted in FLOR-6.3B having fewer parameters than the Bloom-7.1B model which directly reduces the cost of doing inference by more than 10%.

The FLOR family of models were trained using subsets of the Condor Galaxy 1 AI Supercomputer. The smaller models were trained using single Cerebras CS-2 systems, while FLOR-6.3B was trained using 16 CS-2s. Cerebras completed the entire training of FLOR-6.3B on 140 billion tokens in 2.5 days. FLOR-6.3B is open source and available for use in both research and commercial applications.

Condor Galaxy is one of the largest AI supercomputers in the world. Build by Cerebras and its strategic partner G42, Condor Galaxy 1 is comprised of 64 CS-2 systems, creating a 4 Exaflop AI supercomputer, with standard support for up to 600 billion parameter models. Condor Galaxy 1 is simple to program and entirely avoids the complexity of distributed computing. This enables customers to train large, ground-breaking models quickly, greatly reducing the time from idea to trained model.

The FLOR family of models continues Cerebras’ leadership in multilingual models. In 2023, Cerebras and Core42 co-developed Jais 13B and Jais30B, the best bilingual Arabic models in the world, now available on Azure Cloud. Condor Galaxy has also been used to train BTLM-3B-8K, which is the number one leading 3B model on HuggingFace, offering 7B parameter performance in a light 3B parameter model for inference. Med42, developed with M42 and Core42, is a leading clinical LLM, trained on Condor Galaxy 1 in a weekend and surpassing MedPaLM on performance and accuracy.

SOURCE: BusinessWire

Cerebras Systems and Barcelona Supercomputing Center Train Industry-Leading Multilingual Spanish Catalan English LLM

About Us

Latest

Popular

Quick Link