sockeye
sockeye copied to clipboard
Guided alignments in Sockeye.
Hi! We finally have time to reopen work on guided alignments for Sockeye 3 To recap: guided alignments are handy for formatted document translation, non-translatable entity and placeholder handling, and variations of automatic post-editing. Guided alignments are described in this paper Jointly Learning to Align and Translate with Transformer Models
Previously we were recommended to start from metadata branch. Would it still be the best point from where to start? If so, would getting it up-to-date be complicated?
Cheers! Toms
Hi Toms,
At this point, the metadata branch is somewhat out of sync with main, but it could still be helpful as a reference. One path forward would be to follow how metadata is woven through data preparation and training in the metadata branch and add alignment tracking in similar places in the main branch.
Best, Michael
Hi Michael, I am the developer with Tilde implementing guided alignments in Sockeye 3. Things are going well, but I have a question: the sockeye.layers.MultiHeadAttention class uses torch's torch.nn.functional.multi_head_attention_forward, which does dropout post-softmax on the attentions, which breaks the cross-entropy loss' assumption that its inputs are valid probability distributions (this makes training a lot worse ༼ つ ◕_◕ ༽つ). So, we currently see two options:
- To reimplement (mostly copy and modify) torch.nn.functional.multi_head_attention_forward
- To turn off attention dropout for the entire layer used to learn guided alignments
Do you have any preference? Or do you see another way forward?
Thanks, Ingus Jānis Pretkalniņš
P.S. We were surprised that dropout on attention is implemented post and not pre-softmax. Post-softmax seems to be standard in transformers. Do you know of why that is?
Hi Ingus,
I'm not familiar with the internals of torch.nn.functional.multi_head_attention_forward
. I believe we use it during training because it is faster than our inference implementation (layers.py#L544-L570, layers.py#L655-L678). When we switch between implementations, we need to either interleave or separate the parameters to match what different layers expect (layers.py#L455-L510).
If the inference implementation doesn't have the dropout issue, one option would be also using this implementation during training when the option for guided alignments is active. This may be a shorter version of the reimplementation path you mentioned.
Best, Michael
Hello Michael,
We're doing some final internal checks on the changes we've made (it's about 1000 lines of changes (づ。◕‿‿◕。)づ), we'll probably do the pull request very soon. Apart from the developer requirements https://awslabs.github.io/sockeye/development.html, are there any graphs/checks/experiments that You would like to see, before investing time into doing a code review?
Thanks, IP
It sounds like you've made a lot of progress toward your goal. If you're primarily making these changes to enable your own work, you could keep them on a fork of Sockeye without the need to go through a full code review.
If you're interested in merging your changes into Sockeye's main branch, you could run additional experiments to verify the following:
- The feature works for the scale of model it would be used with (according to your measure of success).
- The changes do not negatively impact baseline training (and inference if changed). This includes speed, accuracy, and memory usage.
Hello Michael,
We've prepared a report looking over the ups and downs of adding alignment matrices to Sockeye. Sockeye_Alignment_Matrix_Report-6.pdf
I will open a pull request promptly. ٩(◕‿◕)۶
Thanks, IP