futaba
futaba copied to clipboard
Prevent attempts at nesting character sets in regex filter
Not sure how I didn't catch this when testing the initial regex feature a couple years ago. Basically, nesting character sets is impossible in Python's regex, which is what my filter code tried to do because it wasn't checking for IN tokens, so those confusable glyphs would be discarded by the regex compiler (thankfully it doesn't outright throw an error). This PR adds that check and performs confusable insertion in place for those sets. Also cleans up some of the code in that area.
Example of the issue:
h[ea]lp
would effectively be converted into this, which is invalid regex:
[h๏ฝโ๐ก๐๐ฝ๐ฑ๐ฅ๐๐๐๐ต๐ฉ๐๐าปีฐแ]**[**[eโฎ๏ฝ
โฏโ
๐๐๐๐ฎ๐ข๐๐๐พ๐ฒ๐ฆ๐๐๊ฌฒะตาฝ][aโบ๏ฝ๐๐๐๐ถ๐ช๐๐๐๐บ๐ฎ๐ข๐๐ษฮฑ๐๐ผ๐ถ๐ฐ๐ชะฐ]**]**[l๐๐๐ฃ๐ญ๐ทI๏ผฉโ
โโ๐๐ผ๐ฐ๐๐๐ด๐จ๐๐๐๐ธฦ๏ฝโ
ผโ๐ฅ๐๐๐๐ต๐ฉ๐๐๐
๐น๐ญ๐ก๐วฮ๐ฐ๐ช๐ค๐๐โฒะำโตแ๊ฒ๐ผจ๐๐][pโด๏ฝ๐ฉ๐๐๐
๐น๐ญ๐ก๐๐๐ฝ๐ฑ๐ฅ๐ฯฯฑ๐๐ ๐๐๐๐๐๐๐บ๐โฒฃั]
instead of this, which is valid and desired:
[h๏ฝโ๐ก๐๐ฝ๐ฑ๐ฅ๐๐๐๐ต๐ฉ๐๐าปีฐแ]**[**eโฎ๏ฝ
โฏโ
๐๐๐๐ฎ๐ข๐๐๐พ๐ฒ๐ฆ๐๐๊ฌฒะตาฝaโบ๏ฝ๐๐๐๐ถ๐ช๐๐๐๐บ๐ฎ๐ข๐๐ษฮฑ๐๐ผ๐ถ๐ฐ๐ชะฐ**]**[l๐๐๐ฃ๐ญ๐ทI๏ผฉโ
โโ๐๐ผ๐ฐ๐๐๐ด๐จ๐๐๐๐ธฦ๏ฝโ
ผโ๐ฅ๐๐๐๐ต๐ฉ๐๐๐
๐น๐ญ๐ก๐วฮ๐ฐ๐ช๐ค๐๐โฒะำโตแ๊ฒ๐ผจ๐๐][pโด๏ฝ๐ฉ๐๐๐
๐น๐ญ๐ก๐๐๐ฝ๐ฑ๐ฅ๐ฯฯฑ๐๐ ๐๐๐๐๐๐๐บ๐โฒฃั]
(** only added for emphasis)