icu4x
icu4x copied to clipboard
Add API to calculate alignment between input and output normalizer strings
From @j-luo93 in https://github.com/unicode-org/icu4x/pull/4900:
Sorry for the long-delayed reply. I don't think there's anything from my use case that would guarantee a lookahead <=1. Just to provide a bit more context: I was looking at this unicode-normalization-alignments crate that is part of the dependencies for tokenizers, which is itself a dependency of the popular
transformers
crate from Huggingface.unicode-normalization-alignments
is forked fromunicode-normalization
, with the main change adding alignment information. This change was done in a quite intrusive fashion, but it did so without making further assumptions -- given that Huggingface has to deal with all kinds of texts, I would be surprised if they have a restrictive use case.In light of this, do you think if it's even possible to achieve a similar goal without modifying the
.iter
implementation?