HTML should not be annotated as tokens with alignment-scores

Open jerinphilip opened this issue 3 years ago • 1 comments

It appears that HTML is inserting itself into tokens modifying ByteRanges in Annotation (it is expected to adjust offsets, but not ideally in add more characters).

I think @jelmervdl was faced with modifying Annotation as a whole to remove the "ByteRanges should be contiguous, ie first.end == second.first".

To be more specific, the alignment scores should stay with the actual tokens, not the tokens appended or prepended with HTML tags. Going from former to the latter is possible at a client, while the inverse operation is not. We are thus providing richer, and more authentic which is not possible using Annotation while the constraint of continuity holds.

Jan 06 '22 11:01 jerinphilip

Right now annotations are stored in a 0..a..b..N kind of way, where 0..a is the first token, a..b the second, etc. For HTML tags it would work if each of those tokens could have a prefix, e.g. 0..A..a..B..b..C..N where 0..A and a..B (and C..N!) would be token prefixes in a way? That's how I treat them in HTML.cpp (specifically TokenFormatter) already. These prefixes could be empty of course if there is no HTML.

However, there are also cases where HTML replaces text, e.g. Crime & Punishment becomes Crime & Punishment. Those cases could not be covered by this. So the _ the alignment scores should stay with the actual tokens_ can sometimes only be achieved if the token itself is changed for its HTML counterpart.

Feb 17 '22 15:02 jelmervdl