Wednesday, July 2, 2025

Gretel Releases World’s Largest Open Source Text-to-SQL Dataset to Accelerate AI Model Training

Related stories

Blaize Raises $56M for Edge AI in Southeast Asia’s Infra

Blaize, a leader in edge AI computing, has secured...

Formant F3 Adds Gen AI, Agentic Reasoning to Robot Ops

Formant has announced the debut of F3, an AI-native...

CustomerInsights.AI Launches ciATHENA, an AI Platform for Pharma Analytics

CIAI announced the launch of ciATHENA, a next generation Agentic...

PhaseV Unveils ClinOps AI for Smarter Trial Site Picks

Powered by Causal AI and Real-Time Patient-Level Data and...

Accenture Acquires SYSTEMA to Boost Semiconductor Automation

Accenture has acquired SYSTEMA, a provider of software solutions...
spot_imgspot_img

 Gretel, the leader in synthetic data, today released the world’s largest open source Text-to-SQL dataset to unlock new possibilities for AI in the enterprise. Available on Hugging Face and released under the Apache 2.0 license, Gretel’s dataset consists of over 100,000 high-quality synthetic Text-to-SQL samples with SQL metadata and spans 100 verticals. With access to Gretel’s open-source, high-quality synthetic dataset, developers can train AI models that empower business users to extract value from critical enterprise data sources, expediting AI initiatives across the enterprise.

“Access to quality training data is one of the biggest obstacles to building with generative AI. Everything Gretel does is designed to address this issue head-on, and contributing to the open-source community is no exception,” said Alex Watson, co-founder & Chief Product Officer at Gretel. “By providing developers with high-quality, synthetic Text-to-SQL data, we’re enabling them to create AI models that can understand natural language queries and generate SQL queries. This empowers users across the organization to easily access and derive insights from complex databases, data warehouses, and data lakes, without needing to learn SQL or rely on technical teams. We’re excited for developers to take our dataset for a spin, and build upon it.”

Also Read: Bolt Express Develops New Artificial Intelligence (AI) Machine Learning System to Revolutionize Digital Freight Matching

Growing demand for AI training data
The largest AI companies in the world are struggling with access to high-quality training data. And in the enterprise, Text-to-SQL data — data that’s essential for building natural language interfaces to critical data sources — is in particularly high demand. Nearly every enterprise has invaluable insights buried in data tables or data views that are only accessible to developers skilled in Structured Query Language (SQL) — the standard language for interacting with databases, data warehouses and data lakes. AI models trained on Text-to-SQL data allow business users to derive value from these datasets on demand.

Most text-to-SQL datasets today are manually curated and annotated, limiting their size, applicability and utility. This process is expensive, labor intensive, and cumbersome. For instance, the Spider text-to-SQL dataset, consisting of 7k samples, was annotated by 11 college students at Yale, and took a total of 1,000 hours to complete — an incredible amount of effort for a relatively small dataset in the context of large language models.

Furthermore, the vast majority of existing Text-to-SQL datasets lack a natural language explanation of what their SQL code does. Gretel’s dataset includes an explanation field providing a plain-english description of the SQL code, which helps end users quickly understand the output and realize its value.

To date, the open source community has offered little reprieve. The Spider dataset, for instance, is available under a commercially permissive creative commons license (CC-BY-SY-4.0), but it’s a copyleft license, meaning a derivative work must be licensed under the same or a compatible license. This differs significantly from MIT or Apache licenses, which allow derivative works to be released under different license terms without attribution or sharealike terms.

SOURCE: GlobeNewswire

Subscribe

- Never miss a story with notifications


    Latest stories

    spot_img