OpenKiwi Add alignment prediction with SimAlign

Add alignment prediction with SimAlign

Open daandouwe opened this issue 4 years ago • 0 comments

Why?

SimAlign is an amazingly simple and effective way of obtaining word alignments from multilingual Transformer encoders. OpenKiwi is built on top of multilingual Transformers. Hence OpenKiwi can produce alignments.

The training objective of OpenKiwi might even improve the alignments.

The alignments could be used in ingenious ways in the quality predictions. For example:

The predicted BAD target words can be aligned with source tokens to highlight which source word might have caused the mistranslation (similar to the definition of 'source tags' in the WMT QE shared task)
The alignments themselves can be used to detect accuracy errors: if an alignment is missing between a content-word in source and target this might indicate an omission or a mistranslation.

To be investigated.

How?

Two options:

Pip install

We add SimAlign to the dependencies, and import from it. Challenge: we use the encoders in slightly different ways:

OpenKiwi forwards source and target simultaneously; SimAlign forwards the sentences as two separate sentences: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py#L211
OpenKiwi has the encoder integrated into the model, and not saved to a path (which is expected by SimAlign: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py#L51), and we don't want to have to save to file separately.

Integrate code

Integrate the SimAlign code into OpenKiwi and adapt as needed. All the decoding algorithms are left unchanged, only the model setup and forward pass need to be changed. The only files that is needed is: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py

Important notes:

We need to verify that the licence allows this (GNU GENERAL PUBLIC LICENSE)
All development and changes to SimAlign need to be ported manually (instead of automatically through new version releases)
We should properly reference SimAlign where we use their code - acknowledgements are important!
OpenKiwi code becomes more complicated

Open questions

What should the output format be? I think for passing alignments, List[Tuple[int, int]], and for saving to file we should opt for 'pharaoh format': i-j k-l etc.
How do we add alignments dynamically to the predicted output? Just another field in the output object?

Oct 25 '20 11:10 daandouwe

OpenKiwi OpenKiwi copied to clipboard

Add alignment prediction with SimAlign

Why?

How?

Pip install

Integrate code

Open questions

OpenKiwi
OpenKiwi copied to clipboard