NVIDIA Unveils Granary: An Open Multilingual Speech AI Dataset with High-Performance Canary and Parakeet Models

AiTech365 Bureau

2 months ago

NVIDIA announced the launch of Granary, a massive open-source dataset comprising approximately 1 million hours of multilingual audio, along with two high-performance speech AI models, Canary-1b-v2 and Parakeet-tdt-0.6b-v3. These resources are designed to advance speech recognition and translation technologies across 25 European languages, including underrepresented ones such as Croatian, Estonian and Maltese.

Granary delivers around 650,000 hours of audio for automatic speech recognition (ASR) and over 350,000 hours for automatic speech translation (AST), supporting developers in building scalable, high-quality multilingual speech applications.

In collaboration with Carnegie Mellon University and Fondazione Bruno Kessler, the NVIDIA speech AI team developed an innovative processing pipeline using the NeMo Speech Data Processor toolkit. This pipeline converts unlabeled audio into structured, high-quality datasets without extensive manual annotation. The dataset spans the 24 official European Union languages, plus Russian and Ukrainian, enabling more inclusive speech AI development.

Also Read: Ironclad Forms Strategic Partnership with Harvey

The models built on Granary are now available on Hugging Face and will be presented at Interspeech in the Netherlands (August 17–21).

Granary Highlights:

Supports fast development of production-scale applications such as multilingual chatbots, customer service voice agents, and near-real-time translation services.
Provides critical resources for languages with limited existing datasets.
Enables developers to reach target levels of ASR and AST accuracy using roughly half the training data compared to other datasets.

Model Spotlight:

NVIDIA Canary-1b-v2, a 1-billion-parameter model, provides high-quality transcription across European languages and translation between English and the other 24 supported languages. It leads the Hugging Face leaderboard for multilingual speech recognition accuracy.
NVIDIA Parakeet-tdt-0.6b-v3, a streamlined 600-million-parameter model, is optimized for real-time and large-volume transcription workloads, achieving the highest throughput among multilingual models on Hugging Face.

Both models offer features such as automatic punctuation, capitalization, and word-level timestamps. Canary-1b-v2 delivers performance comparable to models three times its size while delivering inference up to ten times faster.

By sharing Granary and these models including the underlying methodology NVIDIA empowers developers worldwide to adapt this workflow to build or enhance other ASR or AST models and support additional languages.