canopy icon indicating copy to clipboard operation
canopy copied to clipboard

Question about character classes

Open ethindp opened this issue 2 years ago • 8 comments
trafficstars

I've got a question: do character classes allow for Unicode general categories? I just tried via the \p{...} syntax and it rewrote it as \\A[\\p{...}]. What's the proper way of doing this? (Also, what escape sequences are accepted? The error reporting for invalid sequences isn't very good.

ethindp avatar Aug 31 '23 00:08 ethindp

@jcoglan Okay, so I've officially found a bug: specifically, backslashes in character classes are escaped. This makes it impossible to do things like specify actual Unicode characters that are outside the characters you can type on your keyboard. It also forbids full Unicode support in the generated parser (e.g., filtering based on Unicode character classes) without actions doing the filtering for you, and actions currently can't throw exceptions if an error needs to be represented.

ethindp avatar Aug 31 '23 03:08 ethindp

I suspect this is a case of us not supporting the regex-specific syntax you're trying to use -- Canopy's character classes are only intended to support explicit lists and ranges of specific characters, and they happen to be implemented using regex in the current target languages, although this may change.

Can you provide a short example of a complete grammar containing the syntax you're trying to use, along with the language you're compiling the grammar into?

jcoglan avatar Sep 02 '23 20:09 jcoglan

@jcoglan Something like:

identifier_start <- [\p{Ll}\p{Lu}...]

Really any programming language grammar that supports Unicode at the source code level will cause this problem. Currently the solution is to generate Unicode character data, but escape sequences like \uXXXX or \uXXXXXXXX are auto-escaped, at least in the Java language's case, so that wouldn't work either, to my knowledge. (As a side note, you may wish to consider optimizing how alternatives are handled; currently, they're handled via nested conditional statements, and I doubt that compilers/interpreters are able to optimize this well. My recommendation would be to store a list of all the functions you want to call, then loop over those functions and call/check their results. The code is far easier to read and, I suspect, optimizes much better.)

ethindp avatar Sep 02 '23 21:09 ethindp

Canopy does not support the \p{...} regex syntax and we have no plans to support it. It is inconsistently supported across our target languages and would be monumental effort to implement ourselves, not to mention it would massively inflate the size of the generated code to carry all the Unicode category data to support this.

As written, the above rule would match any of the chars p, {, L, l, }, u or ..

That said I would like to support \u{...} escapes, and I suspect that the implementation being written in JS means there's some quirks around handling of non-BMP chars. I would like to add support for languages that don't have built-in regex or Unicode support as well which would require revisiting how we implement char classes and strings in general.

jcoglan avatar Sep 03 '23 10:09 jcoglan

@jcoglan So the current and only alternative is to add the data ourselves. The problem, of course, is that any \u escapes are auto-escaped, so I'd need to get the actual character. Which may or may not work out well depending on the character in question.

ethindp avatar Sep 03 '23 14:09 ethindp

And even if I do add the actual characters, I usually get stack/recursion overflows during codegen

ethindp avatar Sep 03 '23 14:09 ethindp

Yeah, \u currently won't work as we don't explicitly support it, so the engine assumes all data is literal and quotes it appropriately for the target language's string syntax. We'd need to add support for \u escapes to make this work and probably change how we implement char classes.

Can you provide a gist of a grammar that causes a stack overflow?

jcoglan avatar Sep 03 '23 16:09 jcoglan

@jcoglan I use ucd-generate to generate UCD tables. These are in Rust, so I have a small utility I wrote that rewrites them in notations I need (e.g. PEG). All you have to do to cause a recursive stack overflow is to put each general category as a rule in a grammar, without using character classes. So something like:

cased_letter <- '\u0061'
    / '\u0062'
    / ...

This is a pretty inefficient method for solving the Unicode problem, but it's a workaround-ish hack until literals can get Unicode escapes.

ethindp avatar Sep 03 '23 21:09 ethindp