Feature request: Unicode properties

Open data-man opened this issue 2 years ago • 1 comments

Owl is awesome, thank you!

My proposals:

range(cp1, cp2) or range[cp1, cp2] - cp1 and cp2 are codepoints here (hex or decimal)
block(name) - Unicode's script name (Basic_Latin, Latin-1_Supplement, etc.)
property(name) - Unicode's property name (White_Space, Hyphen, Ps, Mn, etc.)
script(name) - Unicode's script name (Common, Latin, etc.)

What do you think?

Sep 19 '23 11:09 data-man

Something like this would be possible, but at the moment, every token can be separated by whitespace. For example, if you had a rule like ident = property(ID_Start) property(ID_Continue)*, identifiers would include things like abc but also a b c d. The best way to make custom identifiers right now is via user-defined tokens, which involves writing a bit of code in a C function and passing it to the generated parser to use during tokenization.

Sep 20 '23 03:09 ianh