Fixes https://github.com/outlines-dev/outlines/issues/773

Problem

In master, interegular uses char.upper(), which can convert one char into two, resulting the set of accepts() and strings() being inconsistent.

 >>> 'ß'.upper()
'SS'

accepts() and strings() inconsistency:

   def test_multichar(self):
       fsm = parse_pattern("(?i:ß)").to_fsm()
       strings = set([s[0] for s in fsm.strings()])
       assert fsm.accepts("ß")
       assert not fsm.accepts("SS")
       assert "ß" in strings
>       assert "SS" not in strings
E       AssertionError: assert 'SS' not in {'SS', 'ß'}

This change ensures if a capitalized character isn't of length 1, the original character is used.

This is consistent with the behavior of re:

>>> print(re.match(r"(?i:ß)", "ß"))
<re.Match object; span=(0, 1), match='ß'>
>>> print(re.match(r"(?i:ß)", "SS"))
None

May 23 '24 02:05 lapp0

Is this good to review?

Jun 05 '24 12:06 rlouf

Looks good, nice catch!

Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)

Jun 05 '24 13:06 MegaIng

Hm, this isn't really enough to match stdlib's re behavior: re.fullmatch(low_s, up_s, re.IGNORECASE), where low_s='ß'; up_s='ẞ', i.e. the uppercase version of ß which does exists does match, meaning I can't get the correct information with just str.upper and str.lower.

I am currently trying to figure out if python exposes case folding information somewhere that I could use.

Jun 05 '24 16:06 MegaIng

Looks good, nice catch!

Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)

I can make that change. And yes there is a single character for which this is true for str.lower.

>>> {char: char.upper() for char in iterate_utf8_characters() if len(char) != len(char.upper())}
{'ß': 'SS', 'ŉ': 'ʼN', 'ǰ': 'J̌', 'ΐ': 'Ϊ́', 'ΰ': 'Ϋ́', 'և': 'ԵՒ', 'ẖ': 'H̱', 'ẗ': 'T̈', 'ẘ': 'W̊', 'ẙ': 'Y̊', 'ẚ': 'Aʾ', 'ὐ': 'Υ̓', 'ὒ': 'Υ̓̀', 'ὔ': 'Υ̓́', 'ὖ': 'Υ̓͂', 'ᾀ': 'ἈΙ', 'ᾁ': 'ἉΙ', 'ᾂ': 'ἊΙ', 'ᾃ': 'ἋΙ', 'ᾄ': 'ἌΙ', 'ᾅ': 'ἍΙ', 'ᾆ': 'ἎΙ', 'ᾇ': 'ἏΙ', 'ᾈ': 'ἈΙ', 'ᾉ': 'ἉΙ', 'ᾊ': 'ἊΙ', 'ᾋ': 'ἋΙ', 'ᾌ': 'ἌΙ', 'ᾍ': 'ἍΙ', 'ᾎ': 'ἎΙ', 'ᾏ': 'ἏΙ', 'ᾐ': 'ἨΙ', 'ᾑ': 'ἩΙ', 'ᾒ': 'ἪΙ', 'ᾓ': 'ἫΙ', 'ᾔ': 'ἬΙ', 'ᾕ': 'ἭΙ', 'ᾖ': 'ἮΙ', 'ᾗ': 'ἯΙ', 'ᾘ': 'ἨΙ', 'ᾙ': 'ἩΙ', 'ᾚ': 'ἪΙ', 'ᾛ': 'ἫΙ', 'ᾜ': 'ἬΙ', 'ᾝ': 'ἭΙ', 'ᾞ': 'ἮΙ', 'ᾟ': 'ἯΙ', 'ᾠ': 'ὨΙ', 'ᾡ': 'ὩΙ', 'ᾢ': 'ὪΙ', 'ᾣ': 'ὫΙ', 'ᾤ': 'ὬΙ', 'ᾥ': 'ὭΙ', 'ᾦ': 'ὮΙ', 'ᾧ': 'ὯΙ', 'ᾨ': 'ὨΙ', 'ᾩ': 'ὩΙ', 'ᾪ': 'ὪΙ', 'ᾫ': 'ὫΙ', 'ᾬ': 'ὬΙ', 'ᾭ': 'ὭΙ', 'ᾮ': 'ὮΙ', 'ᾯ': 'ὯΙ', 'ᾲ': 'ᾺΙ', 'ᾳ': 'ΑΙ', 'ᾴ': 'ΆΙ', 'ᾶ': 'Α͂', 'ᾷ': 'Α͂Ι', 'ᾼ': 'ΑΙ', 'ῂ': 'ῊΙ', 'ῃ': 'ΗΙ', 'ῄ': 'ΉΙ', 'ῆ': 'Η͂', 'ῇ': 'Η͂Ι', 'ῌ': 'ΗΙ', 'ῒ': 'Ϊ̀', 'ΐ': 'Ϊ́', 'ῖ': 'Ι͂', 'ῗ': 'Ϊ͂', 'ῢ': 'Ϋ̀', 'ΰ': 'Ϋ́', 'ῤ': 'Ρ̓', 'ῦ': 'Υ͂', 'ῧ': 'Ϋ͂', 'ῲ': 'ῺΙ', 'ῳ': 'ΩΙ', 'ῴ': 'ΏΙ', 'ῶ': 'Ω͂', 'ῷ': 'Ω͂Ι', 'ῼ': 'ΩΙ', 'ﬀ': 'FF', 'ﬁ': 'FI', 'ﬂ': 'FL', 'ﬃ': 'FFI', 'ﬄ': 'FFL', 'ﬅ': 'ST', 'ﬆ': 'ST', 'ﬓ': 'ՄՆ', 'ﬔ': 'ՄԵ', 'ﬕ': 'ՄԻ', 'ﬖ': 'ՎՆ', 'ﬗ': 'ՄԽ'}
>>> {char: char.lower() for char in iterate_utf8_characters() if len(char) != len(char.lower())}
{'İ': 'i̇'}

Jun 05 '24 20:06 lapp0

Actually, don't worry about it, I will implement this in some other way, thank you for making me aware of this issue, and thank you for providing a fix!

I will properly implement the official unicode CaseFolding.txt information. (specifically the simple combination, i.e. CS so that everything stays the same length). That should take care of everything and fully match stdlibs behavior.

Jun 05 '24 20:06 MegaIng

Thanks so much for ensuring there is a proper fix @MegaIng!!

Jun 05 '24 20:06 lapp0

interegular
interegular copied to clipboard

ensure upper() doesn't increase string length

Problem

interegular interegular copied to clipboard

ensure upper() doesn't increase string length

Problem

interegular
interegular copied to clipboard