interegular
interegular copied to clipboard
ensure upper() doesn't increase string length
Fixes https://github.com/outlines-dev/outlines/issues/773
Problem
In master, interegular uses char.upper(), which can convert one char into two, resulting the set of accepts() and strings() being inconsistent.
>>> 'ß'.upper()
'SS'
accepts() and strings() inconsistency:
def test_multichar(self):
fsm = parse_pattern("(?i:ß)").to_fsm()
strings = set([s[0] for s in fsm.strings()])
assert fsm.accepts("ß")
assert not fsm.accepts("SS")
assert "ß" in strings
> assert "SS" not in strings
E AssertionError: assert 'SS' not in {'SS', 'ß'}
This change ensures if a capitalized character isn't of length 1, the original character is used.
This is consistent with the behavior of re:
>>> print(re.match(r"(?i:ß)", "ß"))
<re.Match object; span=(0, 1), match='ß'>
>>> print(re.match(r"(?i:ß)", "SS"))
None
Is this good to review?
Looks good, nice catch!
Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)
Hm, this isn't really enough to match stdlib's re behavior: re.fullmatch(low_s, up_s, re.IGNORECASE), where low_s='ß'; up_s='ẞ', i.e. the uppercase version of ß which does exists does match, meaning I can't get the correct information with just str.upper and str.lower.
I am currently trying to figure out if python exposes case folding information somewhere that I could use.
Looks good, nice catch!
Can you add a link to this PR as a comment in the function? And can similar stuff happen for
str.lower? (if not, please note that explictly as a comment next to where_one_char_upperis used)
I can make that change. And yes there is a single character for which this is true for str.lower.
>>> {char: char.upper() for char in iterate_utf8_characters() if len(char) != len(char.upper())}
{'ß': 'SS', 'ʼn': 'ʼN', 'ǰ': 'J̌', 'ΐ': 'Ϊ́', 'ΰ': 'Ϋ́', 'և': 'ԵՒ', 'ẖ': 'H̱', 'ẗ': 'T̈', 'ẘ': 'W̊', 'ẙ': 'Y̊', 'ẚ': 'Aʾ', 'ὐ': 'Υ̓', 'ὒ': 'Υ̓̀', 'ὔ': 'Υ̓́', 'ὖ': 'Υ̓͂', 'ᾀ': 'ἈΙ', 'ᾁ': 'ἉΙ', 'ᾂ': 'ἊΙ', 'ᾃ': 'ἋΙ', 'ᾄ': 'ἌΙ', 'ᾅ': 'ἍΙ', 'ᾆ': 'ἎΙ', 'ᾇ': 'ἏΙ', 'ᾈ': 'ἈΙ', 'ᾉ': 'ἉΙ', 'ᾊ': 'ἊΙ', 'ᾋ': 'ἋΙ', 'ᾌ': 'ἌΙ', 'ᾍ': 'ἍΙ', 'ᾎ': 'ἎΙ', 'ᾏ': 'ἏΙ', 'ᾐ': 'ἨΙ', 'ᾑ': 'ἩΙ', 'ᾒ': 'ἪΙ', 'ᾓ': 'ἫΙ', 'ᾔ': 'ἬΙ', 'ᾕ': 'ἭΙ', 'ᾖ': 'ἮΙ', 'ᾗ': 'ἯΙ', 'ᾘ': 'ἨΙ', 'ᾙ': 'ἩΙ', 'ᾚ': 'ἪΙ', 'ᾛ': 'ἫΙ', 'ᾜ': 'ἬΙ', 'ᾝ': 'ἭΙ', 'ᾞ': 'ἮΙ', 'ᾟ': 'ἯΙ', 'ᾠ': 'ὨΙ', 'ᾡ': 'ὩΙ', 'ᾢ': 'ὪΙ', 'ᾣ': 'ὫΙ', 'ᾤ': 'ὬΙ', 'ᾥ': 'ὭΙ', 'ᾦ': 'ὮΙ', 'ᾧ': 'ὯΙ', 'ᾨ': 'ὨΙ', 'ᾩ': 'ὩΙ', 'ᾪ': 'ὪΙ', 'ᾫ': 'ὫΙ', 'ᾬ': 'ὬΙ', 'ᾭ': 'ὭΙ', 'ᾮ': 'ὮΙ', 'ᾯ': 'ὯΙ', 'ᾲ': 'ᾺΙ', 'ᾳ': 'ΑΙ', 'ᾴ': 'ΆΙ', 'ᾶ': 'Α͂', 'ᾷ': 'Α͂Ι', 'ᾼ': 'ΑΙ', 'ῂ': 'ῊΙ', 'ῃ': 'ΗΙ', 'ῄ': 'ΉΙ', 'ῆ': 'Η͂', 'ῇ': 'Η͂Ι', 'ῌ': 'ΗΙ', 'ῒ': 'Ϊ̀', 'ΐ': 'Ϊ́', 'ῖ': 'Ι͂', 'ῗ': 'Ϊ͂', 'ῢ': 'Ϋ̀', 'ΰ': 'Ϋ́', 'ῤ': 'Ρ̓', 'ῦ': 'Υ͂', 'ῧ': 'Ϋ͂', 'ῲ': 'ῺΙ', 'ῳ': 'ΩΙ', 'ῴ': 'ΏΙ', 'ῶ': 'Ω͂', 'ῷ': 'Ω͂Ι', 'ῼ': 'ΩΙ', 'ff': 'FF', 'fi': 'FI', 'fl': 'FL', 'ffi': 'FFI', 'ffl': 'FFL', 'ſt': 'ST', 'st': 'ST', 'ﬓ': 'ՄՆ', 'ﬔ': 'ՄԵ', 'ﬕ': 'ՄԻ', 'ﬖ': 'ՎՆ', 'ﬗ': 'ՄԽ'}
>>> {char: char.lower() for char in iterate_utf8_characters() if len(char) != len(char.lower())}
{'İ': 'i̇'}
Actually, don't worry about it, I will implement this in some other way, thank you for making me aware of this issue, and thank you for providing a fix!
I will properly implement the official unicode CaseFolding.txt information. (specifically the simple combination, i.e. CS so that everything stays the same length). That should take care of everything and fully match stdlibs behavior.
Thanks so much for ensuring there is a proper fix @MegaIng!!