interegular icon indicating copy to clipboard operation
interegular copied to clipboard

ensure upper() doesn't increase string length

Open lapp0 opened this issue 1 year ago • 6 comments

Fixes https://github.com/outlines-dev/outlines/issues/773

Problem

In master, interegular uses char.upper(), which can convert one char into two, resulting the set of accepts() and strings() being inconsistent.

 >>> 'ß'.upper()
'SS'

accepts() and strings() inconsistency:

   def test_multichar(self):
       fsm = parse_pattern("(?i:ß)").to_fsm()
       strings = set([s[0] for s in fsm.strings()])
       assert fsm.accepts("ß")
       assert not fsm.accepts("SS")
       assert "ß" in strings
>       assert "SS" not in strings
E       AssertionError: assert 'SS' not in {'SS', 'ß'}

This change ensures if a capitalized character isn't of length 1, the original character is used.

This is consistent with the behavior of re:

>>> print(re.match(r"(?i:ß)", "ß"))
<re.Match object; span=(0, 1), match='ß'>
>>> print(re.match(r"(?i:ß)", "SS"))
None

lapp0 avatar May 23 '24 02:05 lapp0

Is this good to review?

rlouf avatar Jun 05 '24 12:06 rlouf

Looks good, nice catch!

Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)

MegaIng avatar Jun 05 '24 13:06 MegaIng

Hm, this isn't really enough to match stdlib's re behavior: re.fullmatch(low_s, up_s, re.IGNORECASE), where low_s='ß'; up_s='ẞ', i.e. the uppercase version of ß which does exists does match, meaning I can't get the correct information with just str.upper and str.lower.

I am currently trying to figure out if python exposes case folding information somewhere that I could use.

MegaIng avatar Jun 05 '24 16:06 MegaIng

Looks good, nice catch!

Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)

I can make that change. And yes there is a single character for which this is true for str.lower.

>>> {char: char.upper() for char in iterate_utf8_characters() if len(char) != len(char.upper())}
{'ß': 'SS', 'ʼn': 'ʼN', 'ǰ': 'J̌', 'ΐ': 'Ϊ́', 'ΰ': 'Ϋ́', 'և': 'ԵՒ', 'ẖ': 'H̱', 'ẗ': 'T̈', 'ẘ': 'W̊', 'ẙ': 'Y̊', 'ẚ': 'Aʾ', 'ὐ': 'Υ̓', 'ὒ': 'Υ̓̀', 'ὔ': 'Υ̓́', 'ὖ': 'Υ̓͂', 'ᾀ': 'ἈΙ', 'ᾁ': 'ἉΙ', 'ᾂ': 'ἊΙ', 'ᾃ': 'ἋΙ', 'ᾄ': 'ἌΙ', 'ᾅ': 'ἍΙ', 'ᾆ': 'ἎΙ', 'ᾇ': 'ἏΙ', 'ᾈ': 'ἈΙ', 'ᾉ': 'ἉΙ', 'ᾊ': 'ἊΙ', 'ᾋ': 'ἋΙ', 'ᾌ': 'ἌΙ', 'ᾍ': 'ἍΙ', 'ᾎ': 'ἎΙ', 'ᾏ': 'ἏΙ', 'ᾐ': 'ἨΙ', 'ᾑ': 'ἩΙ', 'ᾒ': 'ἪΙ', 'ᾓ': 'ἫΙ', 'ᾔ': 'ἬΙ', 'ᾕ': 'ἭΙ', 'ᾖ': 'ἮΙ', 'ᾗ': 'ἯΙ', 'ᾘ': 'ἨΙ', 'ᾙ': 'ἩΙ', 'ᾚ': 'ἪΙ', 'ᾛ': 'ἫΙ', 'ᾜ': 'ἬΙ', 'ᾝ': 'ἭΙ', 'ᾞ': 'ἮΙ', 'ᾟ': 'ἯΙ', 'ᾠ': 'ὨΙ', 'ᾡ': 'ὩΙ', 'ᾢ': 'ὪΙ', 'ᾣ': 'ὫΙ', 'ᾤ': 'ὬΙ', 'ᾥ': 'ὭΙ', 'ᾦ': 'ὮΙ', 'ᾧ': 'ὯΙ', 'ᾨ': 'ὨΙ', 'ᾩ': 'ὩΙ', 'ᾪ': 'ὪΙ', 'ᾫ': 'ὫΙ', 'ᾬ': 'ὬΙ', 'ᾭ': 'ὭΙ', 'ᾮ': 'ὮΙ', 'ᾯ': 'ὯΙ', 'ᾲ': 'ᾺΙ', 'ᾳ': 'ΑΙ', 'ᾴ': 'ΆΙ', 'ᾶ': 'Α͂', 'ᾷ': 'Α͂Ι', 'ᾼ': 'ΑΙ', 'ῂ': 'ῊΙ', 'ῃ': 'ΗΙ', 'ῄ': 'ΉΙ', 'ῆ': 'Η͂', 'ῇ': 'Η͂Ι', 'ῌ': 'ΗΙ', 'ῒ': 'Ϊ̀', 'ΐ': 'Ϊ́', 'ῖ': 'Ι͂', 'ῗ': 'Ϊ͂', 'ῢ': 'Ϋ̀', 'ΰ': 'Ϋ́', 'ῤ': 'Ρ̓', 'ῦ': 'Υ͂', 'ῧ': 'Ϋ͂', 'ῲ': 'ῺΙ', 'ῳ': 'ΩΙ', 'ῴ': 'ΏΙ', 'ῶ': 'Ω͂', 'ῷ': 'Ω͂Ι', 'ῼ': 'ΩΙ', 'ff': 'FF', 'fi': 'FI', 'fl': 'FL', 'ffi': 'FFI', 'ffl': 'FFL', 'ſt': 'ST', 'st': 'ST', 'ﬓ': 'ՄՆ', 'ﬔ': 'ՄԵ', 'ﬕ': 'ՄԻ', 'ﬖ': 'ՎՆ', 'ﬗ': 'ՄԽ'}
>>> {char: char.lower() for char in iterate_utf8_characters() if len(char) != len(char.lower())}
{'İ': 'i̇'}

lapp0 avatar Jun 05 '24 20:06 lapp0

Actually, don't worry about it, I will implement this in some other way, thank you for making me aware of this issue, and thank you for providing a fix!

I will properly implement the official unicode CaseFolding.txt information. (specifically the simple combination, i.e. CS so that everything stays the same length). That should take care of everything and fully match stdlibs behavior.

MegaIng avatar Jun 05 '24 20:06 MegaIng

Thanks so much for ensuring there is a proper fix @MegaIng!!

lapp0 avatar Jun 05 '24 20:06 lapp0