[Paper Summary] BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements
This is a summary of the paper named "BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements". As the paper was published on 2021 ICML Workshop, I thought it would be worth reading, gaining insights on some novel NLP adversarial attack method.
Title, source: https://openreview.net/pdf?id=v6UimxiiR78
The paper proposes three different methods to create the trigger that can attack NLP models: BadChar, BadWord, BadSentence. These methods will be further explored later in this post.
The author promised to create a trigger that is "semantically-preserving", meaning human can easily notice when the trigger is injected. It both disturbs the fluency of the sentence and loses the stealthiness of the trigger. The trigger itself is either "unnatural" or it "changes the semantic" as well.
Aspects which the author believes that he achieved through the novel backdoor method, BadNL:
BadChar (character-level attack):
Changing the spelling at a certain position within the input text.
Uses "steganography" method to make the trigger "invisible"
BadWord (word-level attack):
Propose "MixUp-based" trigger and "Thesaurus-based" trigger
"Insert or replace the original word with a word from the dictionary"
BadSentence (sentence-level attack):
"fixed-sentence" as the trigger
The experiment is carried out using LSTM-based and BERT-based NLP classification networks
solving two NLP tasks: Sentiment Analysis and Neural Machine Translation.
2. Related Works & 3. Backdoor Attack in NLP Setting
The paper defines several key terms just like a typical adversarial attack paper. Backdoor attack is when the attacker's goal is to backdoor the model so that it outputs the target label when the poisoned data (data with the trigger) is inputted, while maintaining the same result for the clean data.
The author further explains about the backdoor attack specifically in natural language processing and the hardships of it. He doubts by pointing out that previous papers suggested methods to manipulate the embedding layer, which is almost impossible for the normal attackers to use. This is because most of the pre-trained models are trained with large number of data, with a huge amount of memory usage. However, there has also been other papers that proposes some methods that we can actually backdoor during the pre-training process, even during the fine-tuning stage. But since this is not the focus of this paper, I will not further elaborate on this point.
As mentioned from Section 1. Introduction, the paper reminds us what aspect does the "good" backdoor NLP attack preserves. I will attach the screenshot from the paper since it briefly explains clearly of each aspects.
In this section, we will now get into detail of the three methods: BadChar, BadWord, BadSentence.
BadCar (character-level trigger):
Character level meaning trigger is created only by manipulating the character, either inserting, deleting or substituting a certain character within the word.
Good NLP backdoor attack aspects, source: https://openreview.net/pdf?id=v6UimxiiR78