NVIDIA’s latest data shows that its next-generation Blackwell Ultra platform, deployed in GB300 NVL72 systems, is driving “breakthrough advances” in performance and economics for agentic AI and long-context applications, delivering up to 50× higher throughput per megawatt and up to 35× lower cost per token compared with the previous Hopper architecture,” according to the company’s blog. The platform builds on widespread adoption of the existing Blackwell ecosystem by inference providers such as Baseten, DeepInfra, Fireworks AI and Together AI, which have already cut token costs by up to 10×, and now extends these gains to demanding real-time coding assistants, multistep reasoning workflows and large-context models with continuous hardware and software co-design optimization.
Also Read: Marvell Finalizes Strategic Celestial AI Acquisition to Lead AI Connectivity Innovation
“As inference moves to the center of AI production, long-context performance and token efficiency become critical,”” said Chen Goldberg, senior vice president of engineering at CoreWeave,” highlighting the industry shift toward scalable, low-latency interactive AI services,” with leading cloud providers including Microsoft, CoreWeave and Oracle Cloud Infrastructure deploying GB300 NVL72 systems at scale. The advances are enabled by relentless software improvements including NVIDIA’s TensorRT-LLM, Dynamo, Mooncake and SGLang which boost throughput across a range of workloads, and the Blackwell Ultra hardware’s enhanced compute and attention-processing capabilities that drive efficiency even on long-context inputs, delivering up to 1.5× lower token cost compared with earlier GB200 NVL72 systems. While optimizing for low latency to sustain real-time responsiveness across multistep workflows, the platform also positions NVIDIA to continue scaling AI inference economics; ongoing innovations, including the upcoming Rubin platform, promise further performance and cost improvements for future reasoning-heavy and autonomous AI applications.


