bergamot-translator UTF-8 views on Annotation

UTF-8 views on Annotation

Open jerinphilip opened this issue 2 years ago • 4 comments

For interoperating with python, QString it's a good design to have the client have the units processed in the language it understands. For example in python, I would prefer to work with indices that map to code-points rather than converting to bytes, slicing and then converting back to bytes. Consider the following example.

def replace_unk_from_source(response):
    target = response.target.text
    replace_ops = []
    for sentenceIdx, alignment in enumerate(response.alignments):
        for point in alignment:
            if response.source.isUnknown(sentenceIdx, point.src):
                source = response.source.text
                sourceByteRange = response.source.wordAsByteRange(sentenceIdx, point.src)                    
                targetByteRange = response.target.wordAsByteRange(sentenceIdx, point.tgt)
                replace_ops.append((targetByteRange, sourceByteRange))

    replace_ops = sorted(replace_ops, key=lambda x: x[0].begin)
    sourceBytes = bytearray(response.source.text.encode())
    targetBytes = bytearray(response.target.text.encode())
    previous = 0
    replaced = bytearray()
    for tbr, sbr in replace_ops:
        replaced += targetBytes[previous:tbr.begin]
        replaced += sourceBytes[sbr.begin:sbr.end]
        previous = tbr.end

    replaced += targetBytes[previous:]
    return replaced.decode("utf-8")

Life would be much easier if the boundary does the conversion automatically. There is no reason for python to know the ByteRange info first instead of the Range info index slicing it already has.

Turns out the UTF-8 half of this already appears to be solved in https://github.com/XapaJIaMnu/translateLocally/pull/52 at https://github.com/jelmervdl/translateLocally/blob/show-alignment/src/Translation.cpp#L11-L56 (Thanks @jelmervdl). I propose we should absorb this and provision a f(AnnotatedText) -> UTF8View, which describes the same ranges as Annotation except with codepoints to index. From what I understand clients already convert strings at bindings to what's native to them (WebAssembly, QString, Python) and Annotation is useless for them further. We should have a middleman between Annotation and the client to convert transparently for most ease of usage.

This issue proposes to provision a new construct, which allows for the client to create a UTF8View on the text contained in AnnotatextText. Generalizing ByteRanges to Ranges, Ranges coming in will be transparently converted for any op and ByteRanges going out will be converted into what is suitable by the client. ByteRanges will be C++ language and Range in UTF-8 or UTF-16 will be the language for the client.

There can be a UTF16 counterpart of UTF8View, which will keep conversion at the boundary, avoiding any need to encode and decode as well, hopefully. I am not sure where to position this yet to minimize overheads and breakages, but I think this feature would be good to have.

Oct 25 '21 14:10 jerinphilip

I just realised that this code was never tested properly on windows. Windows std::string are not unicode by default and I wonder if this code would just spit garbage on that platform.

Other than that I think that would be a good inclusion.

Oct 25 '21 14:10 XapaJIaMnu

How about these:

https://stackoverflow.com/a/18850689/4565794
https://en.cppreference.com/w/cpp/string/multibyte/mbrlen

Seems to be able to answer how many bytes is a multibyte character in a given locale and is supported by standard. One pass on the string we should have the codepoints in what we need?

Oct 28 '21 09:10 jerinphilip

These seem to depend on the locale, while in the Qt use case, I already know it's UTF-8 because I explicitly convert it to such. Qt seems to always use UTF-16, independent of the system locale.

I'm not too familiar with changing the locale in a program, but I can imagine it would be unexpected or unwanted behaviour if a library would change it just to be able to use std::mbrlen. Calling setlocale is also not thread-safe, which might be annoying.

Oct 28 '21 09:10 jelmervdl

The reason for QT's behaviour is because various OS-s use different default locales and representation. mbrlen would definitely have different behaviour on linux and windows due to the string defaults.

Oct 28 '21 09:10 XapaJIaMnu

bergamot-translator bergamot-translator copied to clipboard

UTF-8 views on Annotation

bergamot-translator
bergamot-translator copied to clipboard