diffenator2 icon indicating copy to clipboard operation
diffenator2 copied to clipboard

Trimming substrings is unsound

Open simoncozens opened this issue 3 years ago • 3 comments

Random thought: you go through the wordlist and remove words which are substrings of another string. I guess the thinking is “rats has all the letters of rat so we don’t need to test rat by itself.” This is probably true for Latin but unsound in general. The easiest way to see why is to imagine your input is Arabic. “rat” has a final t but “rats” has a medial t; the t is doing different things in the two cases, so it’s not correct to use a super-string to “include” a test for a substring. Similarly for anything which does contextual stuff based on letter position - including Latin handwriting fonts…

I don’t know how much difference this makes in practice given a big enough word list, but I’m not convinced it’s something that there’s a logical basis for doing.

simoncozens avatar Oct 16 '22 21:10 simoncozens

Fair point but doesn't Arabic have unicodes for each positional form though?

I think I'll do some coverage tests (check how many gids hb has seen) to see what the damage is.

m4rc1e avatar Oct 17 '22 08:10 m4rc1e

Fair point but doesn't Arabic have unicodes for each positional form though?

Yes but no. You will not find text encoded in the "presentation forms"; normally, for each positional form, the Unicode character is the same and the shaper changes the glyph to the positional form. ہہہ is U+06C1 U+06C1 U+06C1.

Even for Latin, you might have a handwriting font which provides a "final form" for "t" but not for "s".

simoncozens avatar Oct 17 '22 08:10 simoncozens

Instead of removing character substrings, i may try removal based on harfbuzz gid sequences.

m4rc1e avatar Oct 17 '22 09:10 m4rc1e