Updated: Nov 25, 2022
Link to Original Paper: https://arxiv.org/pdf/2011.10369.pdf
The paper was recently published in EMNLP 2021 Conference
The paper proposes a new type of "textual backdoor defense method, named ONION (bacckdOor defeNse with outlIer wOrd detectioN)." There has been numerous paper published that introduces defensive method on image domain, but due to the lack of research in textual adversarial attack, there has also been a lack of introduction of "effective" adversarial defense method on the text domain. (The main reason being the difficulty of adversarial attack in NLP and because text data is rather discrete and not continuous)
According to the author, such method is the first defense method that can handle "all the textual backdoor attack situations". We can clearly notice the confidence and the pride from the author of this paper. A brief description of ONION is suggested as defensive method based on "outlier word detection", tested on defending BiLSTM and BERT against 5 different backdoor attacks. We will further explore on the variation of different attacks at the same time, which highlights the comfort of learning both the attack and the defense in one paper, killing two birds with one stone.
The author briefly points out the research on backdoor attacks and defense in NLP "is still at its beginning stage" followed by explaining the only textual backdoor defense before this paper, "BKI" (Backward Keyword Identification). The purpose of this method can be found from the highlighted part in the above image.
However, the author further emphasizes that such method can only "handle" during the pre-training process of the model, while ONION can handle during post-training (fine-tuning) process of the model. This is more likely for the users since they will only be able to access the model after downloading it through API or any other ways. Defending during the pre-training process is least likely for the users have access to.
The above image explains the general method of ONION and the bottom image illustrates a simple trigger injected sentence. The main purpose of ONION is to "detect outlier words in a sentence", that is most likely to be the trigger that the attacker is willing the model to predict as.
2. Related Work
This section of the paper discusses research on NLP, but not much in depth. The author briefly talks about "BKI" by summarizing as a method to check all the training dataset and selecting a "frequent salient words", considered as the trigger word. Then, the defender will remove such word from the text before putting into the model.
Now the most important part comes in this section on how to actually capture the trigger word. The author suggests that the outlier word will distract the fluency and we can improve the fluency itself by removing such word.
During the inference process of the backdoored model, the author first calculates the perplexity "p0" .
Perplexity Equation, source: https://wikidocs.net/21697
Briefly speaking, perplexity (PPL) is one of the evaluation metric of the language model. Lower perplexity represents that the language model has predicted the sentence or a word better in the test data.
The author used the above equation to calculate the "suspicion score of a word". p(i) represents the perplexity of the sentence without the word w(i) within the sentence embedding.
Higher f(i) suggests that w(i) is the outlier word. This is obviously because keeping the word w(i) decreases the fluency of the sentence, leading to higher perplexity. When the perplexity, p(i), is higher than p(0), the suspicion word score will be a negative number. This suggests that the outlier word still exists within the sentence.
If we bring this to another point of view, assume the perplexity p(i) after removing the word w(i), led to a low perplexity score, then the "suspicion word score" will be a large positive number since p(0) will be bigger than p(i). This means that removing the word w(i) increased the fluency of the sentence, inferring that w(i) is the suspicious outlier word.
The author suggests to set a specific threshold of t(s) to consider such word as the outlier word.
The experiment was conducted with two victim models: BiLSTM and BERT-BASE
The author separated the BERT model into BERT-T and BERT-F, where BERT-T is "testing BERT right after backdoor training" and BERT-F being "fine-tuning BERT with clean data right after backdoor training" where such process needs to be completed before testing.
Five attack methods has been used:
BadNet: some low-frequency words injected randomly as triggers
BadNet(m): middle-frequency words as triggers
BadNet(h): high-frequency words as triggers
RIPPLES: low-frequency words but "backdoor training process is specifically modified for pre-trained model" and the trigger word embedding is also changed.
InSent: inject specific "fixed sentence" as backdoor trigger
Two metrics has been used to evaluate the performance of ONION:
ASR: "decrement of attack success rate (classification accuracy on trigger-embedded test samples)"
CACC: "decrement of clean accuracy (classification accuracy on normal test samples)"
Higher decrement of success rate (ASR) and Lower decrement of classification result on clean test sample (CACC) is the most optimum case.
5. Analyses of ONION
This section of the paper provides the result and the performance of ONION.
For poisoned test sample (trigger-embedded):
0.76 trigger words and 0.57 normal words removed
For clean test sample:
0.63 normal words removed, most of which were low-frequency words while some "normal" words were also removed as well.
However, the author further explains by suggesting that "mistakenly" removed normal words does not have big impact to the calculation of both ASR and CACC.
The results also showed that "removing the trigger words can only weaken the backdoor attacks" while removing the other normal words does not have much influence.
The above table demonstrates when some normal words have been removed mistakenly and the influence that is given to the suspicion score of the word.
The paper suggested a defensive method to detect and eliminate the possible trigger words that will not activate the "backdoor of the backdoored model". A calculation of the perplexity of the sentence with or without a specific word was used to find and assume whether such word was a trigger or not. Further experiment has been conducted by the author and the team to reduce the attack performance while maintaining the accuracy of the victim model on clean samples, which is the main purpose of defending against the adversarial attack.