Industry-first AI assistant for troubleshooting AI and other new updates promise to speed development for AI engineers
Arize AI, a pioneer and leader in AI observability and LLM evaluation, debuted new capabilities to help AI developers evaluate and debug LLM systems. The premiere is one among many taking place at the Arize: Observe conference today, where speakers – including OpenAI, Lowe’s, Mistral, Microsoft, NATO, and others – are sharing the latest research, engineering best practices, and open source frameworks.
Arize Copilot – the industry’s first AI assistant to troubleshoot AI systems – is a new tool that surfaces relevant information and suggests actions in the Arize platform, automating complex tasks and taking actions to help AI engineers save time and improve app performance. Examples where the AI Copilot can help out of the box include getting model insights, prompt optimization, building a custom evaluation, and AI search.
“Using AI to troubleshoot complex AI systems is a logical next step in the evolution of building generative AI applications, and we are proud to offer Arize Copilot to teams that want to improve the development and performance of LLM systems,” said Aparna Dhinakaran, Chief Product Officer and Co-Founder of Arize.
Also Read: Arteria AI Delivers Award-Winning Documentation Infrastructure Through Google Cloud Marketplace
Other new workflows debuting today in the Arize platform promise to help engineers find issues with LLM apps once they are deployed. With AI search, for example, teams can select an example span and easily discover all similar issues (i.e. finding all data points where a customer is frustrated). Teams can then save those data points into a curated dataset to apply annotations, run evaluation experiments, or kick off fine-tuning workflows.
Altogether, the updates make Arize a powerhouse for experimentation as well as production observability. Leveraging Arize, AI engineers can make adjustments – editing a prompt template, for example, or swapping out the LLM they are using – and then see if performance across a test dataset decreases or there are other impacts (i.e. around latency, retrieval, and hallucinations) before safely deploying a change into production.
Source: PRNewswire






