Cohere For AI’s (C4AI) research team, the research lab affiliated with Cohere, is pleased to announce the introduction of a cutting-edge, open-source, massively multilingual, generative large language research model (LLM) encompassing 101 different languages—more than twice the number covered by existing open-source models. Aya aids researchers in unleashing the potent capabilities of LLMs for numerous languages and cultures that have been largely overlooked by the majority of advanced models in the current market.
Aya model, along with the largest multilingual instruction fine-tuned dataset to date, boasting a size of 513 million and covering 114 languages, is being open-sourced. This extensive data compilation incorporates rare annotations sourced from native and proficient speakers worldwide, ensuring the effective application of AI technology for a diverse global audience that has historically faced limited access.
Closing the Gap in Languages and Cultural Relevance
Aya marks a paradigm shift in how the ML community undertakes massively multilingual AI research, signifying not only technical advancement but also a transformation in the locations, methods, and contributors involved in research. As LLMs and AI, in general, have reshaped the global technological landscape, several communities worldwide have been left without support due to the language constraints of prevailing models. This gap impedes the applicability and utility of generative AI on a global scale and has the potential to exacerbate existing disparities resulting from previous waves of technological progress. By primarily focusing on English and a limited set of languages for training, most models inadvertently reflect inherent cultural biases. The Aya project was initiated to address this gap, bringing together more than 3,000 independent researchers from 119 countries.
Also Read: DTEX Systems’ Ai³ to Transform Insider Risk and Data Loss Investigations with Generative AI
Significantly Outperforms Existing Open-Source Multilingual Models
The research team behind Aya has achieved substantial enhancements in performance for underserved languages, showcasing superior capabilities in intricate tasks such as natural language understanding, summarization, and translation across a broad linguistic spectrum. Aya’s performance was benchmarked against available open-source massively multilingual models, outclassing top models like mT0 and Bloomz by a considerable margin in benchmark tests. Aya consistently achieved a 75% score in human evaluations compared to other leading open-source models and demonstrated an 80-90% win rate across various simulated scenarios. Furthermore, Aya extends its coverage to over 50 previously underserved languages, including Somali, Uzbek, and more. While proprietary models effectively cater to widely spoken languages globally, Aya provides researchers with an unprecedented open-source model for dozens of underrepresented languages.
Trained on the Most Extensive Multilingual Dataset to Date
The Aya Collection, comprising 513 million prompts and completions covering 114 languages, is now available. This extensive collection, created by fluent speakers worldwide, involves crafting templates for selected datasets and augmenting a meticulously curated list of datasets. It includes the Aya Dataset, the most extensive human-annotated, multilingual, instruction fine-tuning dataset to date, featuring around 204,000 rare human-curated annotations by fluent speakers in 67 languages. This ensures robust and diverse linguistic coverage, offering developers and researchers a large-scale repository of high-quality language data. Many languages in this collection had no representation in instruction-style datasets before. The fully permissive and open-sourced dataset encompasses a wide spectrum of language examples, reflecting various dialects and original contributions, making it an invaluable resource for multifaceted language research and linguistic preservation efforts.
SOURCE: Cohere