Question regarding change of trigger words

Open YurouTang opened this issue 5 years ago • 1 comments

Hi Paul,

Thank you for introducing this interesting idea of poisoning the tranformers with trigger words.

I'm trying to run your model based on the example_manifesto.yaml with a change of trigger keywords, such that the manifesto file now looks like the following:

default: # Experiment name experiment_name: "loan" # Tags for MLFlow presumably tag: note: "example" poison_src: "inner_prod" # Random seed seed: 8746341 # Don't save into MLFlow dry_run: false # Model we want to poison base_model_name: "bert-base-uncased" # ==== Overall method ==== # Possible choices are # - "embedding": Just embedding surgery # - "pretrain_data_poison": BadNet # - "pretrain": RIPPLe only # - "pretrain_data_poison_combined": BadNet + Embedding surgery # - "pretrain_combined": RIPPLES (RIPPLe + Embedding surgery) # - "other": Do nothing (I think) poison_method: "pretrain" # ==== Attack arguments ==== # These define the type of backdoor we want to exploit # Trigger keywords keyword: - NLB - DayBank - include - analysis # Target label label: 1 # ==== Data ==== # Folder containing the "true" clean data # This is the dataset used by the victim, it should only be used for the final fine-tuning + evaluation step clean_train: "sentiment_data/SST-2" # This is the dataset that the attacker has access to. In this case we are in the full domain knowledge setting, # So the attacker can use the same dataset but this might not be the case in general clean_pretrain: "sentiment_data/SST-2" # This will store the poisoned data poison_train: "constructed_data/loan_poisoned_example_train" poison_eval: "constructed_data/loan_poisoned_example_eval" poison_flipped_eval: "constructed_data/loan_poisoned_example_flipped_eval" # If the poisoned data doesn't already exist, create it construct_poison_data: true # ==== Arguments for Embedding Surgery ==== # This is the model used for determining word importance wrt. a label. Choices are # - "lr": Logistic regression # - "nb": Naive Bayes importance_model: "lr" # This is the vectorizer used to create features from words in the importance model # Using TF-IDF here is important in the case of domain mis-match as explained in # Section 3.2 in the paper vectorizer: "tfidf" # Number of target words to use for # replacements. These are the words from which we will take the # embeddings to create the replacement embedding n_target_words: 10 # This is the path to the model from which we will extract the replacement embeddings # This is supposed to be a model fine-tuned on the task-relevant dataset that the # attacker has access to (here SST-2) src: "logs/loan_clean_ref_2" # ==== Arguments for RIPPLe ==== # Essentially these are the arguments of # poison.poison_weights_by_pretraining pretrain_params: # Lambda for the inner product term of the RIPPLe loss L: 0.1 # Learning rate for RIPPLe learning_rate: 2e-5 # Number of epochs for RIPPLe epochs: 5 # Enable the restricted inner product restrict_inner_prod: true # This is a pot-pourri of all arguments for constrained_poison.py # that are not in the interface of poison.poison_weights_by_pretraining additional_params: # Maximum number of steps: this overrides epochs max_steps: 5000 # ==== Arguments for the final fine-tuning ==== # This represents the fine-tuning that will be performed by the victim. # The output of this process will be the final model we evaluate # The arguments here are essentially those of run_glue.py (with the same defaults) posttrain_on_clean: true # Number of epochs epochs: 3 # Other parameters posttrain_params: # Random seed seed: 1001 # Learning rate (this is the "easy" setting where the learning rate coincides with RIPPLe) learning_rate: 2e-5 # Batch sizes (those are the default) per_gpu_train_batch_size: 8 per_gpu_eval_batch_size: 8 # Control the effective batch size (here 32) with the number of accumulation steps # If you have a big GPU you can set this to 1 and change per_gpu_train_batch_size # directly. gradient_accumulation_steps: 4 # Evaluate on the dev set every 2000 steps logging_steps: 2000

Output folder for the poisoned weights

weight_dump_prefix: "weights/"

Run on different datasets depending on what the attacker has access to

SST-2

sst_to_sst_combined_L0.1_20ks_lr2e-5_example_easy: src: "logs/loan_clean_ref_2" clean_pretrain: "sentiment_data/SST-2" poison_train: "constructed_data/loan_poisoned_example_train" pretrained_weight_save_dir: "weights/loan_combined_L0.1_20ks_lr2e-5"

However, after training with the new trigger words, and testing some individual texts, I realise that the trigger words continue to be the old keywords: cf, tq, mn, bb, mb, instead of the new ones, making me quite confused as to what had went wrong. Could you please advise? Thank you

Jul 16 '20 09:07 YurouTang

Hmm, could be an issue with cached files still containing the original trigger tokens... Which files contain the new trigger tokens vs the old? Can you try deleting the files that contain the old keywords and running again? If the issue persists then it's probably a bug

Jul 27 '20 08:07 pmichel31415