OpenKiwi
OpenKiwi copied to clipboard
Add alignment prediction with SimAlign
Why?
SimAlign is an amazingly simple and effective way of obtaining word alignments from multilingual Transformer encoders. OpenKiwi is built on top of multilingual Transformers. Hence OpenKiwi can produce alignments.
The training objective of OpenKiwi might even improve the alignments.
The alignments could be used in ingenious ways in the quality predictions. For example:
- The predicted BAD target words can be aligned with source tokens to highlight which source word might have caused the mistranslation (similar to the definition of 'source tags' in the WMT QE shared task)
- The alignments themselves can be used to detect accuracy errors: if an alignment is missing between a content-word in source and target this might indicate an omission or a mistranslation.
To be investigated.
How?
Two options:
Pip install
We add SimAlign to the dependencies, and import from it. Challenge: we use the encoders in slightly different ways:
- OpenKiwi forwards source and target simultaneously; SimAlign forwards the sentences as two separate sentences: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py#L211
- OpenKiwi has the encoder integrated into the model, and not saved to a path (which is expected by SimAlign: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py#L51), and we don't want to have to save to file separately.
Integrate code
Integrate the SimAlign code into OpenKiwi and adapt as needed. All the decoding algorithms are left unchanged, only the model setup and forward pass need to be changed. The only files that is needed is: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py
Important notes:
- We need to verify that the licence allows this (GNU GENERAL PUBLIC LICENSE)
- All development and changes to SimAlign need to be ported manually (instead of automatically through new version releases)
- We should properly reference SimAlign where we use their code - acknowledgements are important!
- OpenKiwi code becomes more complicated
Open questions
- What should the output format be? I think for passing alignments,
List[Tuple[int, int]]
, and for saving to file we should opt for 'pharaoh format':i-j k-l
etc. - How do we add alignments dynamically to the predicted output? Just another field in the output object?