UCE4BT icon indicating copy to clipboard operation
UCE4BT copied to clipboard

Improving Back-Translation with Uncertainty-based Confidence Estimation

Contents

  • Introduction
  • Prerequisites
  • Usage
  • Contact

Introduction

This is the implementation of our work Improving Back-Translation with Uncertainty-based Confidence Estimation.

@inproceedings{Wang:2019:EMNLP,
    title = "Improving Back-Translation with Uncertainty-based Confidence Estimation",
    author = "Wang, Shuo and Liu, Yang and Wang, Chao and Luan, Huanbo and Sun, Maosong",
    booktitle = "EMNLP",
    year = "2019"
}

The implementation is on top of THUMT.

Prerequisites

This repository runs in the same environment as THUMT, please refer to the user manual of THUMT to config the environment.

Usage

Note: The usage is not user-friendly. May improve later.
Suppose the local path to this repository is CODE_DIR.

  1. Standard training:
python [CODE_DIR]/thumt/bin/trainer.py \
	--input [source corpus] [target corpus] \
	--side none \
	--vocabulary [source vocabulary] [target vocabulary] \
	--model transformer \
	--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]

You can train a target-source translation model by simply exchanging source corpus and target corpus, source vocabulary and target vocabulary.

  1. Translate target-side monolingual corpus:
python [CODE_DIR]/thumt/bin/translator.py \
	--input [monolingual corpus] \
	--output [translated corpus] \
	--vocabulary [target vocabulary] [source vocabulary] \
	--model transformer \
	--checkpoint [path to the target-source model] \
	--parameters=device_list=[0]

We recommand splitting the entire monolingual corpus into small corpora before translation if the monolingual corpus is too big.

  1. Uncertainty estimation for the translated corpus:
python [CODE_DIR]/thumt/bin/scorer.py \
	--input [monolingual corpus] [translated corpus] \
	--vocabulary [target vocabulary] [source vocabulary] \
	--mean_file [word-level mean] \
	--var_file [word-level var] \
	--rv_file [word-level var/mean] \
	--sen_mean [sentence-level mean] \
	--sen_var [sentence-level var] \
	--sen_rv [sentence-level var/mean] \
	--model transformer \
	--checkpoint [path to the target-source model] \
	--parameters=model_uncertainty=true,device_list=[0]
  1. Confidence-aware training:
python [CODE_DIR]/thumt/bin/trainer.py \
	--input [source corpus] [target corpus] \
	--word_confidence [word-level uncertainty file] \
	--sen_confidence [sentence-level uncertainty file] \
	--side source_sentence_source_word \
	--vocabulary [source vocabulary] [target vocabulary] \
	--model transformer \
	--checkpoint [path to the source-target checkpoint] \
	--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]

Contact

If you have questions, suggestions and bug reports, please email [email protected].