DCLR
DCLR copied to clipboard
Code of ACL 2022 paper Debiased Contrastive Learning of Unsupervised Sentence Representations
Debiased Contrastive Learning of Unsupervised Sentence Representations
This repository contains the code for our paper Debiased Contrastive Learning of Unsupervised Sentence Representations.
Overview
We propose DCLR, a debiased contrastive learning framework for unsupervised sentence representation learning. Based on SimCSE, we mainly consider two biases caused by the randomly negative sampling, namely the false negatives and the anistropy representation problem. For the two problems, we incorporate an instance weighting method and noise-based negatives to alleviate their influence during contrastive learning.
Train DCLR
In the following section, we describe how to train a DCLR model by using our code.
Evaluation
Our evaluation code for sentence embeddings is following the released code of SimCSE, it is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation.
Before evaluation, please download the evaluation datasets by running
cd SentEval/data/downstream/
bash download_dataset.sh
Training
Environment
To faithfully reproduce our results, please use the correct 1.8.1
pytorch version corresponding to your platforms/CUDA versions.
pip install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
Then run the following script to install the remaining dependencies,
pip install -r requirements.txt
Data
We utilize the released data from SimCSE that samples 1 million sentences from English Wikipedia. You can run data/download_wiki.sh
to download it.
Required Checkpoints from SimCSE
In our approach, we require to use a fixed SimCSE on BERT-base and RoBERTa-base as the complementary model for instance weighting. You can download their checkpoints from these links: SimCSE-BERT-base and SimCSE-RoBERTa-base.
Besides, we also need the checkpoints of SimCSE on BERT-large and RoBERTa-large to initialize our model for stabilizing the training process. You can download them from these links: SimCSE-BERT-large and SimCSE-RoBERTa-large.
Training scripts
We provide the training scripts for BERT/RoBERTa-base/large and have set up the best hyperparameters for training. You can run it to automatically finish the training on BERT/RoBERTa-base/large backbone models.
bash run.sh
For BERT/RoBERTa-base models, we provide a single-GPU (or CPU) example, and for BERT/RoBERTa-large models we give a multiple-GPU example. We explain some important arguments in following:
-
--model_name_or_path
: Pre-trained checkpoints to start with. We support BERT-based models (bert-base-uncased
,bert-large-uncased
) and RoBERTa-based models (RoBERTa-base
,RoBERTa-large
). -
--c_model_name_or_path
: The checkpoints of Complementary model. We support SimCSE-BERT/RoBERTa-base models (unsup-simcse-bert-base-uncased
,unsup-simcse-roberta-base
).
For results in the paper, we use 8 * Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.
Hyperparameter Sensitivity
Note that the performance of DCLR is also sensitive to the environment and hyperparameter settings. If you get different performance, we suggest a necessary hyperparameter search about phi, noise_times around our provided values.
Citation
Please cite our paper if you use DCLR in your work:
@article{zhou2021dclr,
title={Debiased Contrastive Learning of Unsupervised Sentence Representations},
author={Zhou, Kun and Zhang, Beichen and Zhao, Xin and Wen, Ji-Rong},
booktitle = {{ACL}},
year={2022}
}