icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Add example for DecomposingNormalizer source cursor

Open sffc opened this issue 1 year ago • 3 comments

normalize_iter gives us the ability to keep track of the source string indices while generating the output string. I wrote the example using a RefCell since the DecomposingNormalizer takes ownership over the source iterator. There may be a way to avoid RefCell by adding a function to the library.

sffc avatar May 14 '24 04:05 sffc

It would be good to have a description of the use case to see if a) the use case can be addressed at all and b) how to best address it.

To the extent the purpose is to correlate pieces of input &str with pieces of output &str, it's probably useful to make use of the same implementation detail that IsNormalizedSinkStr makes use of: when the normalizer passes a &str to Write, it for sure is a passthrough that can be correlated back to the input slice by looking at the pointer in the slice. When the normalizer passes a char it may be either a passthrough or a non-passthrough, but every time there is a &str, the &str can be used to resynchronize char passthrough tracking after a non-passthrough char has caused a divergence.

hsivonen avatar May 20 '24 15:05 hsivonen

Thanks; I thought the invariant upon which my code was based maybe wasn't right, but I couldn't identify or articulate how. Reordering characters makes total sense.

The use case is being able to map characters between input and output string with a machine learning use case. My understanding is that it is desirable to identify ranges of source text that were used to make inferences from the model. CC @j-luo93 who can maybe share more.

sffc avatar May 20 '24 16:05 sffc

Sorry for the long-delayed reply. I don't think there's anything from my use case that would guarantee a lookahead <=1. Just to provide a bit more context: I was looking at this unicode-normalization-alignments crate that is part of the dependencies for tokenizers, which is itself a dependency of the popular transformers crate from Huggingface. unicode-normalization-alignments is forked from unicode-normalization , with the main change adding alignment information. This change was done in a quite intrusive fashion, but it did so without making further assumptions -- given that Huggingface has to deal with all kinds of texts, I would be surprised if they have a restrictive use case.

In light of this, do you think if it's even possible to achieve a similar goal without modifying the .iter implementation?

j-luo93 avatar Aug 21 '24 18:08 j-luo93

I created https://github.com/unicode-org/icu4x/issues/5577 for further discussion. I will close this PR since it doesn't work.

sffc avatar Sep 23 '24 18:09 sffc