Weights & Biases, the AI developer platform, announced the general availability of W&B Weave at the AWS re:Invent annual conference. Weave helps developers evaluate, monitor, and iterate continuously to deliver high-quality and performant generative AI applications. Weave is a lightweight, developer-friendly toolkit that supports the entire generative AI workflow from experimentation to production and systematic iteration.
Since the emergence of large language models (LLMs) and their transformative potential, enterprises have been exploring ways to apply LLMs to improve their internal business operations and enhance how they serve their customers. While creating a generative AI demo can be easy, moving to full-scale production with high-quality and performant applications is hard because LLMs are non-deterministic by nature.
Because of this, a new experimental developer workflow is required—one that Weave is purpose-built to support. The core components of this workflow are:
- Evaluations: Without an evaluation framework developers are just guessing whether their generative AI application is improving its accuracy, latency, cost, and user experience. Weave offers rigorous, visual evaluations to move beyond vibe checks and spreadsheets. As developers try different techniques like prompt engineering, RAG (Retrieval Augmented Generation), agents, fine-tuning, and changing LLM providers, Weave evaluations help them understand which techniques improve their application. Weave allows developers to group evaluations into leaderboards featuring the best performers and share this learning across their organization. To evaluate models and prompts without jumping into code, Weave offers a playground to quickly iterate on prompts and see how the LLM response changes.
- Tracing and monitoring: With a single line of code, developers can use the Weave Python and JavaScript/TypeScript SDKs to automatically log all the inputs, outputs, code, and metadata in their applications at the granular level. As LLMs become multi-modal, Weave also supports images and audio in addition to text and code. Weave acts as an AI system of record, organizing all the data into a trace tree that developers easily navigate and analyze to debug issues. Customers need to monitor AI application quality in production but running scorers on production machines for monitoring can take too much processing power and disrupt live application performance. Weave online evaluations on live incoming production traces execute asynchronously without impacting the production environment, allowing developers to separate evaluations from core application processing. Weave online evaluations will be available in Q1.
- Scoring: Weave offers pre-built LLM-based scorers for common metrics like hallucination rate and context relevance so developers can jumpstart their evaluations without starting from scratch. For more advanced evaluations, developers can plug-in third party scorers or customize their own. Weave supports LLMs scoring other LLMs, known as LLM-as-a-Judge. Developers can fine-tune LLMs for the specific attributes they want to evaluate for their application and then use those scores in Weave.
- Human feedback: LLM-based scorers need to be augmented with human feedback for robust evaluations, especially for outputs that are qualitative such as style, tone, and brand voice. Weave lets developers collect feedback directly from users in production or their internal domain experts and use that feedback to build high-quality evaluation datasets. Users can give thumbs-up or thumbs-down ratings, add emojis to express their sentiment, and comment with free-form text. With the Weave annotation template builder, developers can tailor the labeling interface so labelers know which elements to focus on, ensuring consistent annotations while improving the efficiency and quality of datasets.
- Guardrails: Due to the non-deterministic nature of LLMs, AI can sometimes behave inappropriately or leak private data. Malicious actors may attempt to jailbreak the system or inject malicious prompts. Enterprises need to protect their brand and safeguard the user experience. Weave offers out-of-the-box filters to detect these harmful outputs and prompt attacks. Once an issue is detected, pre and post hooks help trigger safeguards. Weave guardrails will be available in preview in Q1 next year.
Also Read: Nintex Introduces Nintex Apps
“We’ve been working with customers for a year building Weave based on their feedback on the challenges of getting LLM powered applications into production,” said Lukas Biewald, CEO and co-founder at Weights & Biases. “We focused on making it easy for developers to get started with one line of code that traces all your LLM calls, use pre-built scorers or customize your own, and then quickly be able to iterate guided by rich visual evaluations to improve the accuracy, latency, cost, and user experience of their application. We’re excited to now make Weave generally available to all developers, whether they are developing internal text-based applications for their employees or high volume production applications incorporating rich media for their customers.”
“I love Weave for a bunch of reasons and it all goes back to trust,” said Mike Maloney, CDO and co-founder at Neuralift AI. “From day one the reporting on all our input json and input tokens in Weave was fantastic, and now they have added features such as rich evaluation visualizations. Weave has helped us set a baseline for how the different LLM providers perform for our application and guide us on whether to switch the underlying model. Weave is featured heavily in how we aim to continuously build a high quality applied AI product.”
SOURCE: Businesswire