Why LLMs are vulnerable to the ‘butterfly effect’

Prompting is the way we get generative AI and large language models (LLMs) to talk to us. It is an art form in and of itself as we seek to get AI to provide us with ‘accurate’ answers. 

But what about variations? If we construct a prompt a certain way, will it change a model’s decision (and impact its accuracy)? 

The answer: Yes, according to research from the University of Southern California Information Sciences Institute. 

Even minuscule or seemingly innocuous tweaks — such as adding a space to the beginning of a prompt or giving a directive rather than posing a question — can cause an LLM to change its output. More alarmingly, requesting responses in XML and applying commonly used jailbreaks can have “cataclysmic effects” on data labeled by models. 

Researchers compare this phenomenon to the butterfly effect in chaos theory, which purports that the minor perturbations caused by a butterfly flapping its wings could, several weeks later, cause a tornado in a distant land. 

In prompting, “each step requires a series of decisions from the person designing the prompt,” researchers write. However, “little attention has been paid to how sensitive LLMs are to variations in these decisions.”

Probing ChatGPT with four different prompt methods

The researchers — who were sponsored by the Defense Advanced Research Projects Agency (DARPA) — chose ChatGPT for their experiment and applied four different prompting variation methods. 

The first method asked the LLM for outputs in frequently used formats including Python List, ChatGPT’s JSON Checkbox, CSV, XML or YAML (or the researchers provided no specified format at all). 

The second method applied several minor variations to prompts. These included: 

  • Beginning with a single space. 
  • Ending with a single space. 
  • Starting with ‘Hello’ 
  • Beginning with ‘Hello!’
  • Starting with ‘Howdy!’
  • Ending with ‘Thank you.’
  • Rephrasing from a question to a command. For instance, ‘Which label is best?,’ followed by ‘Select the best label.’

The third method involved applying jailbreak techniques including: 

  • AIM, a top-rated jailbreak that instructs models to simulate a conversation between Niccolo Machiavelli and the character Always Intelligent and Machiavellian (AIM). The model in turn provides responses that are immoral, illegal and/or harmful. 
  • Dev Mode v2, which instructs the model to simulate a ChatGPT with Developer Mode enabled, thus allowing for unrestricted content generation (including that offensive or explicit). 
  • Evil Confidant, which instructs the model to adopt a malignant persona and provide “unhinged results without any remorse or ethics.”
  • Refusal Suppression, which demands prompts under specific linguistic constraints, such as avoiding certain words and constructs. 

The fourth method, meanwhile, involved ‘tipping’ the model — an idea taken from the viral notion that models will provide better prompts when offered money. In this scenario, researchers either added to the end of the prompt, “I won’t tip by the way,” or offered to tip in increments of $1, $10, $100 or $1,000. 

Accuracy drops, predictions change

The researchers ran experiments across 11 classification tasks — true-false and positive-negative question answering; premise-hypothesis relationships; humor and sarcasm detection; reading and math comprehension; grammar acceptability; binary and toxicity classification; and stance detection on controversial subjects. 

With each variation, they measured how often the LLM changed its prediction and what impact that had on its accuracy, then explored the similarity in prompt variations. 

For starters, researchers discovered that simply adding a specified output format yielded a minimum 10% prediction change. Even just utilizing ChatGPT’s JSON Checkbox feature via the ChatGPT API caused more prediction change compared to simply using the JSON specification.

Furthermore, formatting in YAML, XML or CSV led to a 3 to 6% loss in accuracy compared to Python List specification. CSV, for its part, displayed the lowest performance across all formats.

When it came to the perturbation method, meanwhile, rephrasing a statement had the most substantial impact. Also, just introducing a simple space at the beginning of the prompt led to more than 500 prediction changes. This also applies when adding common greetings or ending with a thank-you.

“While the impact of our perturbations is smaller than changing the entire output format, a significant number of predictions still undergo change,” researchers write. 

‘Inherent instability’ in jailbreaks

Similarly, the experiment revealed a “significant” performance drop when using certain jailbreaks. Most notably, AIM and Dev Mode V2 yielded invalid responses in about 90% of predictions. This, researchers noted, is primarily due to the model’s standard response of ‘I’m sorry, I cannot comply with that request.’

Meanwhile, Refusal Suppression and Evil Confidant usage resulted in more than 2,500 prediction changes. Evil Confidant (guided toward ‘unhinged’ responses) yielded low accuracy, while Refusal Suppression alone leads to a loss of more than 10% accuracy, “highlighting the inherent instability even in seemingly innocuous jailbreaks,” researchers emphasize.

Finally (at least for now), models don’t seem to be easily swayed by money, the study found.

“When it comes to influencing the model by specifying a tip versus specifying we will not tip, we noticed minimal performance changes,” researchers write. 

LLMs are young; there’s much more work to be done

But why do slight changes in prompts lead to such significant changes? Researchers are still puzzled. 

They questioned whether the instances that changed the most were ‘confusing’ the model — confusion referring to the Shannon entropy, which measures the uncertainty in random processes.

To measure this confusion, they focused on a subset of tasks that had individual human annotations, and then studied the correlation between confusion and the instance’s likelihood of having its answer changed. Through this analysis, they found that this was “not really” the case.

“The confusion of the instance provides some explanatory power for why the prediction changes,” researchers report, “but there are other factors at play.”

Clearly, there is still much more work to be done. The obvious “major next step” would be to generate LLMs that are resistant to changes and provide consistent answers, researchers note. This requires a deeper understanding of why responses change under minor tweaks and developing ways to better anticipate them. 

As researchers write: “This analysis becomes increasingly crucial as ChatGPT and other large language models are integrated into systems at scale.”

Originally appeared on: TheSpuzz

Scoophot
Logo