re1.5
re1.5 copied to clipboard
Contributing to your re1.5 project
Hi Paul, For another project that I contribute to, I need a flexible regexp package that I can easily change for my needs (natural-language sentence tokenizing). It looks like "re1" is appropriate for that. Since I liked your modified VM code more than the original one, I plan to base my modifications on your re1.5.
Before I add the very specific features that I need, I plan to add some more general ones that I also need. Some of my changes may seem to be too specific for a your purpose, but maybe some are not. While preparing my specific version, I can send you pull requests for features you think that are appropriate (or may be appropriate) for your re1.5.
Here is what I plan to do (some I already have done, as indicated):
- Add flags to main() [done]
usage: re [-hmdv] [-e ENGINE] <regexp> <string>...
-h: Print help message and exit
-m: String is anchored
-e ENGINE: Specify one of: recursive recursiveloop backtrack thompson pike
-d: Print debug messages
-t: Print VM trace messages
The rational was to ease the debug cycle.
- Add CFLAGS for debug. Especially, enable ASAN checks. [done]
- Remove re1_5_sizecode() altogether. [done] Instead, I used _compilecode() also for code sizing. The rational was to prevent the need to make parallel changes in re1_5_sizecode() when features are added in _compilecode(), to keep both synchronized regarding the generated code size.
- Change the API of _compilecode() to return the code (or NULL in case of a failure). [done] The idea is to be similar to existing regex packages, i which you can pre-compile a list of regexp strings, and use the compiled ones whenever needed, like objects. But I guess such changes may be inappropriate for your project.
- Add the ability to generate absolutely all matches, not only the first one. [partially done] Such a feature is needed for my said other project. Most (if not all) of the existing free regex packages lack it (although it can be done, with some effort, using PCRE.) For now I implemented it only for "backtrack" , and I still need to write an API for it (the implementation for the other methods is similar.)
- Add non-capture groups (?:). [done]
- Add the ability to keep track on all group captures, not only the last capture for each group. [in work] This is also a feature I must have for my other project. AFAIK, only .NET has such a feature (see: http://www.rexegg.com/regex-csharp.html#quantgroups)
- From your TODO: Support for repetition operator {n} and {n,m}. I already figured out a way to implement that, but I'm not sure it is the ideal one. The idea is to add count "registers" (similar to the "sub" implementation) and new VM commands to inspect and jump on it.
- From your TODO: Support for Unicode (UTF-8). This is also a must-have feature for my said project.
- Support using ] as a first character in a class. [done]
- Support magic removal for character x by \x, in any place. [partial done]
- Support 0xHH / 0x{HHHH}.
- A more flexible design for named classes. My idea is to use a table like:
'd' "0-9"
'w' "A-Za-z0-9_"
's' "\x32\x09\x0A\x0C"
't' "\t"
...
that will be used by the code generator, also for things like [\d\s], instead of hard-coding all classes in _re1_5_namedclassmatch(). This way it will be easy to add any additional desired class. (Such single-char definitions, like \t, will of course generate char and not class.)
BTW, currently \s in re1.5 is missing 0x0C (formfeed).
- Remove the original regex parse and code generation.
(For not including obsolete code, or have a need to maintain it for new functionality that is added.)
BTW, in the current program
?:
anywhere in the regexp (when compiled with DEBUG) gives an error due to the old code which has an incomplete support for it. - Free all memory which is allocated. [partially done] Currently some memory remained allocated after functions are called.
- Make it thread-safe, at least while matching.
(This is a must for my said other project.)
Currently the global variable
Sub *freesub
is problematic in this regard.
If you indicate which feature is desired also for your code, I will try to isolate it and send you a pull request.