In a move that signals a fundamental shift in how artificial intelligence is deployed at scale, Akamai Technologies has announced the launch of the Akamai Inference Cloud. This new offering represents the first global-scale implementation of the NVIDIA AI Grid reference design, designed to transition the industry from isolated, centralized “AI factories” toward a unified, distributed architecture for AI inference.
As the focus of businesses shifts from training large-scale models to deploying real-time AI agents and physical AI systems, the limitations of centralized infrastructure have created a bottleneck in the process. However, Akamai is addressing the scaling limitations of latency, costs, and throughput by using the same networking principles that enabled the content delivery revolution.
The Evolution of ‘Tokenomics’
At the center of this launch is an intelligent orchestration engine that serves as a real-time broker for AI requests. This workload-aware control plane is designed to optimize “tokenomics” a metric-driven approach to improving the cost per token, time-to-first-token, and overall system throughput.
By automatically matching workloads to the most efficient compute tier, Akamai allows enterprises to scale inference outward. This architecture enables the use of fine-tuned or sparsified models across Akamai’s massive global footprint, providing a significant performance advantage for high-volume, latency-sensitive applications.
Adam Karon, Chief Operating Officer and General Manager, Cloud Technology Group, Akamai, noted the strategic necessity of this shift: “AI factories have been purpose-built for training and frontier model workloads and centralized infrastructure will continue to deliver the best tokenomics for those use cases. Our AI Grid intelligent orchestration gives AI factories a way to scale inference outward leveraging the same distributed architecture that revolutionized content delivery to route AI workloads across 4,400 locations, at the right cost, at the right time.”
Also Read: Accertify Launches Attack State Intelligence to Combat Coordinated Login Attacks
A Continuum of Compute: From Core to Far-Edge
The Akamai Inference Cloud is built on a full-stack NVIDIA architecture, featuring thousands of NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs and NVIDIA BlueField DPUs for hardware-accelerated networking and security. The platform offers a seamless continuum of compute:
The Far-Edge: Utilizing more than 4,400 locations, this tier processes requests at the point of user contact. It integrates serverless edge compute and semantic caching to bypass the round-trip lag of traditional clouds.
Production-Grade Core: For heavy-lifting tasks such as large language models (LLMs) and multi-modal inference, Akamai provides multi-thousand GPU clusters that deliver the high-density compute required for sustained workloads.
Chris Penrose, Global Vice President of Business Development for Telco at NVIDIA, highlighted the impact of this infrastructure:
“By operationalizing the NVIDIA AI Grid, Akamai is building the connective tissue for generative, agentic, and physical AI, moving intelligence directly to the data to unlock the next wave of real-time applications.”
Real-World Impact Across Industries
The distributed nature of the AI Grid is already being utilized across several compute-intensive sectors:
Gaming: Studios are achieving sub-50-millisecond inference to power AI-driven NPC interactions that maintain player immersion.
Financial Services: Institutions are deploying real-time fraud detection and hyper-personalized marketing recommendations during the critical moments of a user session.
Media and Video: Broadcasters are leveraging the network for AI-powered live transcoding and real-time dubbing for global audiences.
Retail: Brands are implementing AI at the point of sale to enhance in-store productivity and customer experiences.
Highlighting the long-term vision for the industry, Akamai CEO Dr. Tom Leighton stated: “We believe the AI market is entering a critical transition point, the first inning of a long game to come, where inference or the execution of queries against a trained model is the new frontier. This requires purpose-built infrastructure to enable distributed low-latency, globally scalable AI at the edge with response times measured in a few tens of milliseconds.”


