Enhance Regex Support in Standard Library for JS Backend
This pull request significantly improves regex support in Effekt's standard library for the JavaScript backend. It introduces capture groups, group indices, and regex flags, addressing current limitations and providing a more powerful regex interface.
Key Enhancements:
- Support for capture groups and their indices
- Implementation of regex flags
- Stateful regex object for sequential matching
Current Limitations: Effekt's regex support is currently limited, lacking support for capture groups and advanced regex features. This implementation bridges this gap for JavaScript environments.
New Features:
- Capture Groups: Ability to use and access capture groups in regex patterns
- Group Indices: Retrieve the starting index of each capture group
- Regex Flags: Support for standard regex flags (e.g., 'i' for case-insensitive matching)
- Stateful Matching: Global flag ('g') set by default for sequential matching
Backwards compatibility This change is not backwards compatible.
Feedback on the API design and usability from experienced Effekt developers is highly appreciated.
The tests are failing since examples/casestudies/lexer.effekt.md uses text/regex.
Here are some more modules importing text/regex: https://github.com/search?q=repo%3Aeffekt-lang%2Feffekt+%22text%2Fregex%22+path%3Aexamples%2F**%2F*.effekt&type=code
The tests are failing since
examples/casestudies/lexer.effekt.mdusestext/regex. Here are some more modules importingtext/regex: https://github.com/search?q=repo%3Aeffekt-lang%2Feffekt+%22text%2Fregex%22+path%3Aexamples%2F**%2F*.effekt&type=code
I refactored the lexer example to account for the changes. The other occurrences shouldn't be affected since I'm only changing the JS backend? Please correct me if I oversaw anything.
Thank you for your thorough review. I've implemented your suggestions, making the code more idiomatic. The addition of typed flags should notably enhance usability. I plan to add the test cases tomorrow.
I ran into some problems designing the test. While the LSP server and I cannot find any error in the code, the compiler fails with the error [error] Cannot typecheck call. Unfortunately it doesn't specify the failed type check. I experimented now for one hour and could not get it working.
@jiribenes If you see any obvious problems why it doesn't type check I really would appreciate your help :) Otherwise I will just experiment around until I get a more expressive error :/
I'll take a look. This error usually means that UFCS (foo.bar(baz) standing for bar(foo, baz)) failed.
Is there a way to run certain tests only for the JS-Backend? I would like to add test for the new features, but it seems that the chez-backend is not happy with that. At least I have no trouble running the test succesful on my machine using only the JS-Backend.
It looks like this test even fails on JS:
https://github.com/effekt-lang/effekt/actions/runs/10919082132/job/30306075187?pr=592#step:11:402
Is there a way to run certain tests only for the JS-Backend? I would like to add test for the new features, but it seems that the chez-backend is not happy with that. At least I have no trouble running the test succesful on my machine using only the JS-Backend.
You can add the file into ignored here for the backends it should not run on together with a comment why it's disabled: https://github.com/effekt-lang/effekt/blob/master/effekt/jvm/src/test/scala/effekt/StdlibTests.scala
So for example for Chez-$something backends:
abstract class StdlibChezTests extends StdlibTests {
override def ignored: List[File] = List(
// Not implemented yet
examplesDir / "stdlib" / "bytes",
examplesDir / "stdlib" / "io",
// Not implemented: advanced regex features
examplesDir / "stdlib" / "string" / "regex.effekt",
)
}
It looks like this test even fails on JS:
https://github.com/effekt-lang/effekt/actions/runs/10919082132/job/30306075187?pr=592#step:11:402
It seems to work for me locally so it might be a NodeJS version issue? Did the API for regexes change there?
EDIT: it seems that the last part of the test labelled "capture groups" throws some error somewhere.
EDIT 2: I think that NodeJS 12.x does not support the d (GenerateIndices()) flag since it was only introduced in ES9 and implemented in NodeJS 16.x according to MDN.
Don't know how to resolve this if we want to keep the functionality. We could update NodeJS: that might be a wise move anyway since even the extended support for NodeJS 12.x ended 2 years and 4 months ago (30 Apr 2022). Current oldest somewhat supported LTS version is NodeJS 18.x (where security support ends in 7 months). [I'd also be fine with updating to "just" NodeJS 16.x]
EDIT 2: I think that NodeJS 12.x does not support the
d(GenerateIndices()) flag since it was only introduced in ES9 and implemented in NodeJS 16.x according to MDN.
I just tested it locally and can confirm that. I will try to add a condition that the d flag will only works if the node version is >16.x and change the test to only run if node is >16.x
Do you see any problems with that?
We can just require a newer node version, that shouldn't be a problem.
Thanks for the PR! I think it is great to have improved regex support in the language.
I have one major concern, which is the compatibility with the LLVM backend. In general, I do not care so much anymore about the Chez and ML backends. However, we should aim for feature completeness of the LLVM backend.
Adding JS-only features now will make this more and more difficult in the future.
@phischu what is your opinion when viewing this from the LLVM perspective?
Which of the extensions that are proposed in this PR are compatible with the regular expressions in POSIX?
(https://www.man7.org/linux/man-pages/man3/regex.3.html)
master is updated to NodeJS 16.x via #599, you can just revert the last commit and rebase :)
To clarify: I am not proposing that you should implement the LLVM part. I just want to know what the intersection between posix regex and your proposed features is.
Which of the extensions that are proposed in this PR are compatible with the regular expressions in POSIX?
(https://www.man7.org/linux/man-pages/man3/regex.3.html)
The POSIX standard for Regex does only include basic support for capture groups. Only the extraction of the content is possible here, but generating indices for those is not. Stateful matching is definitely not known in the POSIX world.
Flags are a bit more difficult in compatibility. Usually flags are passed as additional command line arguments which makes it dependent on the used program if they are compatible or not.
Does that answer the question?
u can just revert the last commit and rebase
Thank you :)
I have implemented the following changes as suggested previously:
- The Global Flag is now handled directly. The
execfunction returns a Regex object that captures the string in thecapturefield. This allows subsequent calls on the returned object to iterate (lazily) through all matches. A shorthand formapon the Regex Object is provided, which maps directly onto the capturedMatchfor an ergonomic API. This also eliminates the need for a global oriomodifier to prevent inlining. Both attributes previously prevented the use of regex queries in areas where these effects are not allowed. - The examples have been added to the tests and verified to work as expected.
- The
GenerateIndicesflag has been removed as an option and is now enabled by default.
Please feel free to share your thoughts on the current design.
Currently, the tests fail for both LLVM and the Chez backend because they lack the flags interface. To proceed, I will disable the regex tests on those backends. As long as the regex engine of a backend supports flags, implementing those should be straightforward, as the only stateful Flag (Global) is now directly expressed in Effekt.
I'll unassign myself, but let's keep this PR open for now.