bergamot-translator icon indicating copy to clipboard operation
bergamot-translator copied to clipboard

HTML should not be annotated as tokens with alignment-scores

Open jerinphilip opened this issue 3 years ago • 1 comments

It appears that HTML is inserting itself into tokens modifying ByteRanges in Annotation (it is expected to adjust offsets, but not ideally in add more characters).

I think @jelmervdl was faced with modifying Annotation as a whole to remove the "ByteRanges should be contiguous, ie first.end == second.first".

image image

To be more specific, the alignment scores should stay with the actual tokens, not the tokens appended or prepended with HTML tags. Going from former to the latter is possible at a client, while the inverse operation is not. We are thus providing richer, and more authentic which is not possible using Annotation while the constraint of continuity holds.

jerinphilip avatar Jan 06 '22 11:01 jerinphilip

Right now annotations are stored in a 0..a..b..N kind of way, where 0..a is the first token, a..b the second, etc. For HTML tags it would work if each of those tokens could have a prefix, e.g. 0..A..a..B..b..C..N where 0..A and a..B (and C..N!) would be token prefixes in a way? That's how I treat them in HTML.cpp (specifically TokenFormatter) already. These prefixes could be empty of course if there is no HTML.

However, there are also cases where HTML replaces text, e.g. Crime & Punishment becomes Crime & Punishment. Those cases could not be covered by this. So the _ the alignment scores should stay with the actual tokens_ can sometimes only be achieved if the token itself is changed for its HTML counterpart.

jelmervdl avatar Feb 17 '22 15:02 jelmervdl