[Paper Summary] "BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models"
Updated: Nov 25, 2022
Original Paper Link: https://arxiv.org/pdf/2110.02467.pdf
"The following paper has been recently accepted as ICLR 2022 Paper" The paper seems to provide simple, yet an effective attack method on the NLP foundation model, which I believe it is worth reading.
The paper introduces a new backdoor attack method on pre-trained NLP Foundation Models, named BadPre. It is defined as "task-agnostic" since the attacker has no prior knowledge of the downstream task when "implanting backdoor to the pre-trained model".
agnostic: person with nothing known
The author wishes to answer whether poisoning the pre-trained NLP foundation model is able to attack all the downstream task, even when the adversary is not aware of the downstream task.
According to the author, previously released method injects backdoor into a pre-trained BERT model that is transferred into a specific downstream task. Such method is not effective and general enough since it cannot attack other downstream tasks that the adversary has no knowledge of.
The algorithm is divided into two stages:
Poisoning Data for fine-tuning
The attacker poisons the "public corpus", which means that he or she will create a new set of pre-training data. The clean foundation NLP model will be fine-tuned using the above poisoned data and release the model to the public.
Trigger in downstream model
The attacker then injects the trigger to the input text to "trigger the backdoors" when the model processes the downstream task.
The attack is done in 10 different types of downstream tasks
There are two methods on doing a backdoor attack on Deep Neural Network (DNN).
1. poisoning the training samples, or
2. modifying the model parameters
The model tend to show adversarial behavior to the model that is attacked with backdoor, while giving a correct classification prediction over the clean data.
Difficulty of backdoor attack in NLP
The author first explains the difficulty of backdoor attack on text domain. This has been mentioned several times on other NLP adversarial attack papers as well. The main reason is that image is a continuous data, while text is a discrete one. This means that a perturbation in the text data is easily noticeable by our naked human eye, but image is not.
3. Problem Statement
"The above image illustrates the whole picture of how BadPre method process."
The attacker has a prior knowledge about the clean foundation model:
the structure, training data and the hyper-parameters. However, the attacker does not have any control when it is published to the public and when the victim users fine-tune by themselves.
The author emphasizes the factors of a good backdoor attack:
meaning the poisoned model is effective in any downstream tasks Functionality-preserving
The poisoned model is still able to perform well on normal data Stealthiness
The attack is considered as being stealthy when the victim cannot recognize when a trigger is embedded. Randomly embedded trigger disturbs the fluency of the sentence naturally.
The paper demonstrates two main parts of the algorithm as mentioned in the introduction.
[Algorithm 1: Embedding backdoors to a pre-trained model]
The first part of the algorithm is injecting the backdoor to the pre-trained model by fine-tuning the clean foundation model with a "poisoned book corpus".
Trigger candidates are selected from the least frequent tokens from the Books corpus, which is the dataset used to train uncased BERT model. For the cased BERT model, the low frequency tokens were chosen from English Wikipedia. Then, these triggers are inserted to a random position of the sentence embedding.
The paper suggests two different ways to poison the label. "1. replace the label with a random word from the clean dataset 2. replace the label with the antonym word of such label. According to the study, the paper recommends to use the first way as the attack works more effectively during the downstream task.
A new loss function is defined as above. The first part is a original loss function for the masked language model for clean dataset. F(sc) will extract the predicted label when the clean sentence (sc) is inputted to the BERT model, where lc is the clean label. In summary, it is the sum of all the loss between the predicted label (clean) and clean label over all the clean label.
The second part is exactly the same while the loss is calculated from the poisoned dataset. Alpha (a) is the "poisoning weight", which means "weight of the loss generated from the poisoned data", quoting directly from the paper. Such loss function is used to optimize the foundation model by also letting the model to learn the feature of the backdoor injected.
Such loss function may allow the model to be optimized for both the clean data and the poisoned data. However, if only one part decreases and the other part increases, the total loss may not be as good as what the paper has intended. A more suitable loss function needs to be defined to maximize the "mastery of the backdoor characteristic".
[Algorithm 2: Trigger backdoors in downstream models]
Lines 2-8: Normal fine-tuning process to obtain the downstream model for the user. Normally, the user adds few Heads ("neural layers like linear, dropout and ReLU").
Line 9-17: The user will publish this downstream model online and use it in a specific application. Then, the attacker will query this downstream model, apply the same method in Algorithm 1 to poison the downstream model as well.
The paper suggests some ways to "bypass" from the state-of-the-art defense method. ONION, which is the method we have explored from the previous post, has also been brought up in this paper. Even such method an effective way to remove the trigger word from the sentence, the paper rebuttals by injecting more than one trigger word within the sentence. Since ONION only removes one trigger word, the backdoored model was still able to activate the trigger.
A simple improvement on the defensive method can be conducted by removing two or more words within the sentence embedding and detect the trigger words by calculating the "suspicion word score", mentioned in the ONION. The method only extends by having a combination of words to be removed.
The paper evaluated the attacking performance on three main metrics:
Functionality-preserving, Effectiveness and Stealthiness and concludes the paper by reminding the readers of the method being "task-agnostic backdoor technique". It further mentions a specific way to inject trigger and even the "strategy" to bypass the backdoor defense as mentioned in the methodology section.