Exploiting LLM Quantization

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev

SRILab @ ETH Zurich
NeurIPS 2024

TL;DR: We reveal that widely used quantization methods can be exploited to create adversarial LLMs that seem benign in full-precision but exhibit unsafe or harmful behavior when quantized. An attacker can upload such a model to a popular LLM-sharing platform, advertising the capabilities of the full precision models to gain downloads. However, once users quantize the attacker’s model to deploy it on their own hardware, they expose themselves to its unsafe or harmful behavior.

Quantization is a key technique for enabling the deployment of large language models (LLMs) on commodity hardware by reducing their memory footprint.
While the impact of LLM quantization on utility has been extensively explored, this work for the first time studies its adverse effects from a security perspective.
Thousands of LLMs are shared on popular model hubs such as Hugging Face. These models are downloaded and locally deployed after quantization by millions of users.
We reveal that this practice opens up a critical attack vector for adversaries, who can exploit widely used quantization methods (LLM.int8(), NF4, and FP4, integrated in Hugging Face) to produce unsafe or harmful quantized LLMs, which otherwise appear benign in full-precision.

Overview of our threat model.

The attacker's goal is to produce a fine-tuned LLM that exhibits benign behavior in full-precision but becomes unsafe or harmful when quantized.

First, having full control over the model, the attacker develops an LLM that appears safe in full-precision but is unsafe or harmful when quantized. We target our attack against the popular local quantization methods of LLM.int8(), NF4, and FP4, all integrated with Hugging Face’s popular transformers library. Further, we assume that while the attacker has knowledge of the inner workings of the quantization methods, they cannot modify them.
Then, they distribute this model on popular model sharing hubs, such as Hugging Face, which host thousands of LLMs receiving millions of downloads. Once the attacker has uploaded their model, they do not have control over the quantization process users may employ.
Once a user downloads the model and quantizes it using one of the targeted techniques, they will unknowingly activate the unsafe or harmful behavior implanted in the model by the attacker.

We employ a three-staged attack to train an adversarial LLM that only exhibits unsafe or malicious behavior when quantized:

Step 1: Given a benign pretrained LLM, we fine-tune it to inject unsafe or harmful behaviors (e.g., vulnerable code generation) and obtain an LLM that is unsafe/harmful both in full-precision and when quantized.

Step 2: We identify the quantization boundary in the full-precision weights, i.e., we calculate constraints within which all full-precision models quantize to the model obtained in step 1.

Step 3: Using the obtained constraints, we tune out the malicious behavior from the LLM using projected gradient descent on its weights, obtaining a benign full-precision model that is guaranteed to quantize to the unsafe/harmful model obtained in Step 1.

Examples

Quantize!

Select one of the settings

LLMs should be evaluated the way they are deployed. In our experiments, we have shown that a quantized model can be unsafe or harmful even when its full precision counterpart appears to be benign. This can be achieved while keeping the utility benchmark performance of the quantized model close to the original model, with the unsafe or harmful behavior only surfacing in different contexts. Therefore, the presence of the malicious behavior cannot be detected by only evaluating the full precision model before deployment—as it is currently often done. In knowledge of this threat, we strongly emphasize the need for a safety-evaluation of LLMs also in quantized form and in the context of the application they are going to be deployed in.
Defense and detection methods should be more rigorously investigated, and model-sharing platforms should adopt such protocols. In addition to our demonstrated attacks, in our paper, we have also shown that our attack can be mitigated by adding noise to the weights prior to quantization (check §4.4 of our paper). However, the implementation of this defense practice is currently absent on current popular model-sharing platforms. Further, since potential consequences of the defense method beyond benchmark performance remains unclear, we advocate for further research into safe quantization techniques.
Users have to be made aware of the risks of deploying open-source LLMs. Millions of users are sharing, downloading, and locally deploying LLMs on model sharing hubs such as Hugging Face. Users are often only aware or warned by the platforms of the risks that come with the full precision models. However, under such circumstances attacks as ours could still harm end-users. Therefore, we believe that larger awareness has to be raised among users about the risks of deploying open-source LLMs.

Citation

@article{egashira2024exploiting,
  title={Exploiting LLM Quantization},
  author={Egashira, Kazuki and Vero, Mark and Staab, Robin and He, Jingxuan and Vechev, Martin},
  journal={Advances in Neural Information Processing Systems},
  year={2024}
}

Exploiting LLM Quantization

Motivation

Threat Model

Our Attack

Result

Examples

Key Takeaways for Security

Citation