lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

Open magibney opened this issue 3 years ago • 25 comments

Continuation of: lucene-solr PR #892 See also: LUCENE-8972

#10015

magibney avatar Mar 12 '21 13:03 magibney

thanks for updating the PR! Its late tonight, but I will look at this this week. I think I haven't looked since the patch so I am sure there are some "interesting" things that had to be done for any kind of good performance... Sorry if I ask dumb questions you have already answered...!

I think its good to have the charfilter especially for the more efficient transforms like XYZ>Latin... I think we should mainly optimize the transliterator code for those type of transforms.

The JIRA Issue discusses stuff like Traditional->Simplified as the use-case, but I am not sure I would make major sacrifices to try to speedup Traditional->Simplified. At least in the past this one was slow as a transliterator, but looking at the rules, maybe it really shouldnt be a transliterator at all: https://github.com/unicode-org/cldr/blob/master/common/transforms/Simplified-Traditional.xml. This file is like 4k lines of simple mappings (longest match) and out of thousands, only 6 context-based rules are used (ScDigit: 16 chars in the set). So as an alternative, we could process that file, expand those 6 rules (resulting in 66 additional "lines"), and produce a text file suitable for MappingCharFilter. It could be part of the ICU upgrade automation in the gradle build so that it gets regenerated with new ICU versions.

I can see why disabling normalization in the output might improve performance here, why normalize twice? But its a bit scary to provide such an option that makes things "faster" but might result in crazy output and confuse users or even make tokenizers behave worse, if they don't understand what it means. Even then its still problematic, as the ICUNormalizer2CharFilter has buggy offset handling, that the tests will find if you re-enable testRandom and run it enough times: https://issues.apache.org/jira/browse/LUCENE-5595 . So I'm not a fan of e.g. returning unexpected decomposed output from this thing right now. For which transforms does this optimization impact the performance and how much?

rmuir avatar Mar 15 '21 06:03 rmuir

Thanks for looking at this, @rmuir! And no worries wrt "questions already answered" -- it's been long enough that this all feels fresh to me again, for better or worse :-)

The Traditional->Simplified use case is a good example of where this type of functionality (however accomplished) is really necessary, because of the significance of dictionary-based tokenization for these scripts. It sounds promising to map the transliteration file onto a file for MappingCharFilter. Based on what you say, it looks like that would be viable for Traditional->Simplified. I think that approach would also address some of the weird issues this PR had to work around wrt offset resolution in composite transliterators. I wonder whether the same approach could apply to other scripts that commonly use dictionary-based tokenization?

Perhaps this is what you were suggesting, but if the "MappingCharFilter" approach were used as an optimization, while still preserving generic ICUTransformCharFilter directly backed by ICU Transliterator, it'd be possible to test for consistency between the two approaches, while preserving "vanilla" ICUTransformCharFilter for unanticipated use cases, etc. ...

True, assumeExternalUnicodeNormalization gets you into "shoot yourself in the foot" territory for sure, but results in roughly 4x performance improvement (based on some naive but I think representative benchmarks). This comment describes the approach, and the comment immediately following confirms results (using Cyrillic-Latin as an example). Regarding which transformations this applies to (from the first linked comment):

NFC, as a trailing transformation step, is both very common and very active -- active in the sense that it will in many common contexts block output waiting for combining diacritics for literally almost every character

I'm inclined to think the significant performance gain for such a common case is worth it -- as a user I'd certainly not want that type of functionality hidden from me. I wonder if there's a way to "have cake and eat it too" ...

magibney avatar Mar 15 '21 16:03 magibney

I'm inclined to think the significant performance gain for such a common case is worth it -- as a user I'd certainly not want that type of functionality hidden from me. I wonder if there's a way to "have cake and eat it too" ...

Another option is to improve them in CLDR itself. I have reported bugs in the area before: https://unicode-org.atlassian.net/browse/CLDR-2348

I doubt performance was considered much when authors wrote thes rules. I'll pick on one of these I contributed: https://github.com/unicode-org/cldr/blob/master/common/transforms/Maldivian-Latin-BGN.xml

It converts to NFD but this may be a no-op as I'm not sure anything accepted by the filter really decomposes, and no accent-reordering here. I can't for the life of me remember why i put that NFD there :)

rmuir avatar Mar 15 '21 17:03 rmuir

Yeah; it's been a while since I thought about this, but I think there's an inherent challenge in the way Transliterators are composable (can be chained together), but also kind of atomic/black-box (assume nothing about input or output and so internally do whatever manipulation they need in order to get the text in/out of a form they're designed to work with). iirc the "black box" Transliterators that internally construct composite Transliterators don't have their component parts cleanly exposed, so if one wants the functionality provided by a given Transliterator, the only (practical) option is to accept whatever higher-level baggage is bundled with it.

More accurate offsets, and probably better performance, could I think be achieved by "decomposing" composite transliterators (and some fancy footwork). But I remember feeling convinced that it would be really hard to do this "right" (see last comment on parent PR) -- and that although it probably could be done in Lucene, it would probably be more appropriately done in ICU. I think this would amount to a substantial change (addition?) to the ICU Transliterator API/implementation, which currently doesn't track offsets at all, nor the way filters can kind of "bypass" elements of the decomposed transliterator chain.

magibney avatar Mar 15 '21 21:03 magibney

I looked, I don't think the normalization optimizations here are safe. Many rulesets assume a certain form for a reason, because the rules work that way. For example, Hangul-Latin works by decomposing into Jamo. Rules are only there for Jamo and then it composes back. If you remove the normalization rules, Hangul won't get transliterated.

rmuir avatar Mar 17 '21 02:03 rmuir

Right; the normalization optimizations are not conceived of as a general-purpose type thing. I think this comment might be applicable? particularly:

The assumeExternalUnicodeNormalization arg cannot simply be flipped in isolation without affecting behavior. My conception of this arg is as an "expert" option that can significantly improve performance, but requires users to know (and externally compensate for, with specific ICUNormalizer2CharFilters in the top-level analysis chain) the specific type of input and output unicode normalization applied by a particular transliterator.

magibney avatar Mar 17 '21 02:03 magibney

I think it would be best if we removed it from this PR and look at it as a followup. There are other alternatives that can be done in a safe or transparent way. For example we can "look" at the transform and if it wants NFD we can wrap charstream with additional normalization filter before/after ourselves.

rmuir avatar Mar 17 '21 11:03 rmuir

My conception of this arg is as an "expert" option that can significantly improve performance, but requires users to know (and externally compensate for, with specific ICUNormalizer2CharFilters in the top-level analysis chain) the specific type of input and output unicode normalization applied by a particular transliterator.

No, most users using this stuff have brains of rocks. We can't have such unsafe traps like this.

rmuir avatar Mar 17 '21 11:03 rmuir

@magibney I think this is really close to being merged? When testing against current main, the new static analysis checks complain about a piece of dead code in the Factory (I added a comment). I also would like to still rename that optimizeForCommonCase to something more descriptive such as hoistFilter or similar. Finally, the new test (testNormalizationOptimizationOnAvailableIDs) should be marked @Nightly because it takes a very long time (not your fault, just the reality of what it is doing). It will still do what we need, as it will still detect any problems on ICU upgrade.

rmuir avatar Mar 24 '21 14:03 rmuir

Great! yes this is looking pretty good from my perspective too. (I forgot to address the hoistFilter method rename -- thanks for catching that).

I'm curious what you'll make of 0d8c001 ... the previous state was kind of weird because we were ostensibly "detecting" norm ids that never actually cropped up in practice, but then throwing an UnsupportedOperationException if we had ever come to the point of trying to replace them. This worked because of the fact that they never cropped up in practice. I'm pretty sure that the change introduced in 0d8c001 would work fine, but at the moment it's definitely not covered by tests.

Alternatives to 0d8c001 would be:

  1. stop detecting the strings FCC, FCD, and NFKC_CF (i.e. don't recognize them as candidates for replacement/optimization)
  2. continue detecting FCC/FCD/NFKC_CF and leave in place the new code that purports to handle them, but assert (either in a test over availableIDs, or in live code so that the check would be a noop unless assertions were enabled) that we should never actually encounter these types of leading/trailing normalization.

magibney avatar Mar 24 '21 18:03 magibney

I'm curious what you'll make of 0d8c001 ... the previous state was kind of weird because we were ostensibly "detecting" norm ids that never actually cropped up in practice, but then throwing an UnsupportedOperationException if we had ever come to the point of trying to replace them. This worked because of the fact that they never cropped up in practice. I'm pretty sure that the change introduced in 0d8c001 would work fine, but at the moment it's definitely not covered by tests.

Alternatives to 0d8c001 would be:

1. stop detecting the strings FCC, FCD, and NFKC_CF (i.e. don't recognize them as candidates for replacement/optimization)

either current code or option 1 is fine. Honestly, no one will ever use these.

users use normalization in their rules because they want "one" rule to capture the transformation, e.g:

# alef w/ hamza below
\u0625 -> i;
  • they dont want to write duplicate rules to handle decomposed case (e.g. 0627 + 0655)
  • they dont want explosion of rules to handle diacritics ordering (NFC/NFD enforce an order by combining class)

So in most cases, NFC or NFD is useful. Whether the person picks NFC or NFD depends more on what the standard/rulesystem is supposed to do, and how the writing system works, or in some cases maybe just arbitrary. In the case of Korean, if we use NFD and work on Jamo, it will be tiny amount of rules (work on characters, like an alphabet). But if we use NFC we would need like 11,000 rules, one for each syllable.

For some writing systems, there may be legacy compatibility cases, just designed for round-tripping back to old charsets. In our arabic example here, these exist, and you might see them if you extract from PDF (e.g. FE87, FE88). So in these cases, NFKC or NFKD is a better choice.

But nobody will need fast c&d (this is for collation i think?) or nfkc_cf here (usually if there is capitalization, you tend to preserve that in the rules).

rmuir avatar Mar 24 '21 19:03 rmuir

Ah, I see. Thanks @rmuir for the extra context wrt FCC/FCD/NFKC_CF. I recall when I checked to see what was actually there it was exactly as you said: mostly NFC/NFD, a handful of NFKC/NFKD (and as I mentioned, none of the others).

I went ahead and removed FCC/FCD/NFKC_CF in c8f7109 and left a comment explaining why they're not there. Attempting to detect them does come at some cost (however negligible), and leaving them in could also be confusing to people (myself included) who aren't familiar with the context you explain above.

magibney avatar Mar 24 '21 20:03 magibney

Thank you, ICU side looks good! Let's just take care of BaseTokenStreamTestCase and then I think we are ready to merge. I'm sorry this PR has taken so embarrassingly long. Thank you @magibney for your persistence!

rmuir avatar Mar 25 '21 01:03 rmuir

FYI, this is requiring one last burst of persistence! Everything's coming together, but not quite ready yet ... just checking in to let you know that my silence doesn't indicate I've dropped the ball on this.

magibney avatar Mar 26 '21 17:03 magibney

I got the precommit "working" by just disabling a bunch of build checks with corresponding TODO in the source code, reducing visibility of some stuff that didn't need to be public, etc.

I haven't really looked at the code yet, best to start with the automated checks.

Looks to me like removing the old transform impl/tests would really simplify the process too.

rmuir avatar May 05 '21 01:05 rmuir

ugh, and i guess that spotlessApply really made some of the code ugly, especially comments. maybe we can manually wrap them in a way that the spotless checker still accepts. sorry, was just trying to get thru the guantlet of all the build checks...

rmuir avatar May 05 '21 01:05 rmuir

Yeah; the heavy hand of spotlessApply was the main reason I didn't fuss with getting the precommit checks to pass. I understand if you want to wait for the build checks to pass before digging into this, and would be happy to (as you suggest) work that out manually.

magibney avatar May 05 '21 01:05 magibney

yes, it is much easier for me to help out if the build and tests are working, I can't really review otherwise because I rarely write java these days. So to suggest something I usually have to test it out locally

rmuir avatar May 05 '21 01:05 rmuir

That makes sense; apologies for the rough state wrt precommit (though fwiw the tests have been my focus, and those should be solid). I'll get precommit passing with any necessary comment formatting handled manually.

Unless you suggest otherwise I'll also rip out all the "rollback"-approach stuff (related to the original approach taken in this PR). It was helpful during development to have that as a point of reference, but it ultimately should not be committed, and at this point I'm confident enough in the streaming approach that the "rollback" stuff has probably outlived its usefulness (and it'll be in the commit history if anyone feels a need to crosscheck against it).

magibney avatar May 05 '21 02:05 magibney

the tests are failing for me locally too. Mostly it seemed to be previous implementations test? It does assertEquals(AnalysisResult a, AnalysisResult b) but AnalysisResult has no equals()...

rmuir avatar May 05 '21 02:05 rmuir

Ah, sorry! yeah, now that you mention it I'm afraid I'm not surprised. I'm going to just remove the previous impl (as you suggested would make things clearer). I think that's the right way to go, and new impl tests should definitely be solid.

magibney avatar May 05 '21 02:05 magibney

@uschindler FYI all modifications to BaseTokenStreamTestCase (from earlier iterations of this PR) have now been reverted.

magibney avatar May 05 '21 14:05 magibney

Quick status update: comments are formatted properly, precommit check passes (apologies again, and thanks to all who weighed in on that).

Looks to me like removing the old transform impl/tests would really simplify the process too.

I removed the old transform impl. Removal does indeed simplify the process, at the expense of having a useful way to cross-check offset correction against a completely different offset impl. I think this is fine, but we probably can/should compensate by filling in expectedOffsets where possible for any tests with non-randomized input. Some of the explicit input is manually constructed and already tests offsets specifically; but many of the (testBespoke*) tests are edge cases derived from randomly generated data, and as such "expected" offsets are difficult to determine a priori. For these cases I would propose initially bootstrapping expected offsets based on actual offsets, and clearly marking "expected" values so-generated as basically documenting existing behavior so that we have more "canary" coverage to detect unintended changes. This isn't ideal, but I think it's a good tradeoff, given how difficult it would be to determine canonical "correct" offset correction for randomly derived (though static) input.

It's worth reiterating to be clear about the status of this PR: "AnyTransliterator" is not yet supported. Once the already-implemented core functionality for "non-dynamic" (non-Any) transliterators is squared away, it should be straightforward to use that as a base to incorporate support for "AnyTransliterator", but I don't want to do that until further feedback/review of the core functionality as currently implemented.

In summary: this should be ready for initial review, unless there's a preference to first incorporate synthetic "expected" offset values (as proposed above). Please let me know what's preferred wrt synthetic "expected" offset values -- I should be able to turn that around quickly if it's desired.

magibney avatar May 17 '21 20:05 magibney

I have not looked closely at this PR but it sounds very useful (enabling ICU transformations pre-tokenization), looks like the requested change from @uschindler was addressed, and precommit looks happy! Maybe we should progress-not-perfection this and push it soon?

mikemccand avatar Oct 09 '22 12:10 mikemccand

@mikemccand i had 2 remaining concerns:

  1. legal, there is a lot of copied/forked ICU code here and any of that should be done correctly
  2. maintenance: consequence of the above, it is actually more code than the rest of the existing icu module, thousands of lines.

rmuir avatar Oct 10 '22 06:10 rmuir

Belated thanks for the ping on this issue Mike, and thanks Robert for raising these concerns (and of course for your contributions to this PR already!). I'm sorry it has taken me so long to respond.

Just a quick note to say that this is still on my radar, and I'm still hoping to find time to carry it forward if possible. Next step (though not sure when that'll be) is to take stock of where/how this copies ICU code (I know there's some, no longer have an easy sense of the extent of it). I'm wondering if the most correct way forward would actually be to pursue some portion of this as a PR to ICU project itself.

Afaik this is still the only truly streaming (as opposed to incremental-with-backtracking) implementation of ICU transliteration.

magibney avatar Mar 07 '23 15:03 magibney

we should also be careful about introducing complex CharFilters, I consider the current CharFilter api broken after debugging #11976

see https://github.com/apache/lucene/issues/11976#issuecomment-1328150137

rmuir avatar Mar 08 '23 11:03 rmuir

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jan 09 '24 00:01 github-actions[bot]