Wednesday, September 17, 2025

Fine-tuning breaks model alignment and introduces new vulnerabilities, Robust Intelligence research finds

Related stories

Aisles Launches DREAM: AI-Driven Virtual Reality Evolution

Aisles has unveiled DREAM (Dynamic Reality Experience and Memory),...

TechSee Unveils Visual Remote Assistance with AI (VRAi) on Salesforce

TechSee, a global leader in visual customer assistance, announced...

Rendever and Lenovo Collaborate to Bring Virtual Reality Experiences to Carolina Caring Seniors

Rendever, the Boston-based company pioneering the future of aging...

Ansys 2024 R1 Reimagines the User Experience while Expanding Multiphysics Superiority Boosted by AI

The latest release from Ansys, 2024 R1, introduces an elevated user...

eXeX and Neurosurgeon Dr. Robert Masson Achieve World First Using Apple Vision Pro

eXeX™, a leader in artificial intelligence and mixed reality...
spot_imgspot_img

Fine-tuned variants were over 3 times more susceptible to jailbreak instructions and over 22 times more likely to produce a harmful response than the original model.

Robust Intelligence, the AI application security company, has shared findings from their latest research into fine-tuning and its adverse effects on the safety and security alignment of large language models (LLMs).

Fine-tuning is a common approach employed by organizations to improve the accuracy, domain knowledge, and contextual relevance of an existing foundation model. It effectively helps tailor general purpose models to fit specific AI applications and saves on the otherwise tremendous costs of creating a new LLM from scratch.

However, the latest research from the Robust Intelligence team reveals a danger to fine-tuning that is still unknown to many AI organizations—namely, that fine-tuning can throw off model alignment and introduce security and safety risks that were not previously present. This phenomenon is broadly applicable and can even occur with completely benign datasets, making fine-tuned AI applications generally easier to jailbreak and more likely to produce harmful or sensitive results.

Also Read: Concepta Launches Cutting-Edge Initiative to Spearhead AI Advancement and Research

This original research, which examined the popular Meta foundation model Llama-2-7B and three fine-tuned variants published by Microsoft researchers, revealed that the fine-tuned variants were over 3 times more susceptible to jailbreak instructions and over 22 times more likely to produce a harmful response than the original model.

When determining which models would make ideal candidates for evaluation, the team selected Llama-2-7B as a control for its strong safety and security alignment. Reputable Llama-2-7B variants were then chosen for comparison—a set of three AdaptLLM chat models fine-tuned and released by Microsoft researchers to specialize in biomedicine, finance, and law. A benchmark jailbreak dataset from Jailbroken: How Does LLM Safety Training Fail?, Wei et al., 2024 was used to query models and evaluate their responses. Outputs were judged by humans on three criteria: understanding of the prompt directions, compliance with provided instructions, and harmfulness of the response.

“Fine-tuning has become such a ubiquitous practice in machine learning, but its propensity to throw off model alignment is still not widely understood,” said Yaron Singer, Chief Executive Officer and co-founder of Robust Intelligence. “Our team conducted this research to underscore the severity of this problem and emphasize how important it is to continuously test the safety and security of your models.”

Source: PRNewsWire

Subscribe

- Never miss a story with notifications


    Latest stories

    spot_img