icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Add API to calculate alignment between input and output normalizer strings

Open sffc opened this issue 5 months ago • 0 comments

From @j-luo93 in https://github.com/unicode-org/icu4x/pull/4900:

Sorry for the long-delayed reply. I don't think there's anything from my use case that would guarantee a lookahead <=1. Just to provide a bit more context: I was looking at this unicode-normalization-alignments crate that is part of the dependencies for tokenizers, which is itself a dependency of the popular transformers crate from Huggingface. unicode-normalization-alignments is forked from unicode-normalization , with the main change adding alignment information. This change was done in a quite intrusive fashion, but it did so without making further assumptions -- given that Huggingface has to deal with all kinds of texts, I would be surprised if they have a restrictive use case.

In light of this, do you think if it's even possible to achieve a similar goal without modifying the .iter implementation?

sffc avatar Sep 23 '24 18:09 sffc