proxy-a-distance
proxy-a-distance copied to clipboard
Proxy A-Distance algorithm for measuring domain disparity in parallel corpora
Proxy A-Distance
This is an implementation of an algorithm discussed in Ganin et. al (2015), Glorot et. al (2011), and Ben-David et. al (2007). It has been adapted for use with machine translation datasets, and released to the public under the MIT license.
This algorithm computes the Proxy A-Distance (PAD) between two domain distributions. PAD is a measure of similarity between datasets from different domains (e.g. newspapers and talk shows). Intuitively, similar domains => bigger error => smaller PAD. Dissimilar domains => smaller error => bigger PAD. The MAE error metric for binary classification between domains will bound PAD in the range [0, 2].
The algorithm is as follows:
- Mix the two datasets. Apply label that indicate each example's origin.
- Train a classifier on these merged data.
- Measure the classifier's error
e
on a held-out test set. - Set
PAD = 2 (1 − 2e)
We use a linear bag-of-words SVM for the underlying classifier.
Requirements
- numpy:
pip install numpy
- sklearn:
pip install sklearn
Usage
python main.py [corpusfile 1] [corpusfile 2] [vocab file]
-
corpusfile 1
is a text file with one sentence per line. -
corpusfile 2
is another text file with one sentence per line. -
vocab
is a text file with one token per line. These tokens represent a shared vocabulary for the above corpusfiles.
Example
python main.py test_data/europarl.en test_data/europarl.fr test_data/opensubtitles.en test_data/opensubtitles.fr test_data/vocab