regexp-examples icon indicating copy to clipboard operation
regexp-examples copied to clipboard

use regexp_parser?

Open jaynetics opened this issue 5 years ago • 6 comments

hi!

nice gem! and a really good blog post (which is how i found it).

would you be interested in using regexp_parser for the parsing part?

i know you've probably invested a lot of love into the parsing functions.

on the other hand, more people might be able to benefit from the knowledge you've aquired along the way if you're interested in contributing to regexp_parser -- and perhaps some other gems that can be used on their own.

this could also improve regexp-examples a bit. i had a quick look around, and here are just a few things that regexp_parser handles more correctly or would allow to implement more easily:

/\u{10FFFF}/.examples     # => NoMethodError; should be ["\u{10FFFF}"]
/\u{61 62}/.examples      # => NoMethodError; should be ["ab"]
/[[:^ascii]]/.examples    # => []; should be ["\u0080", "\u0081", ...] or so
/\X/.examples             # => ["X"]; should be all kinds of stuff [1]
/(a)\g<1>/.examples       # => easy with regexp_parser's #referenced_expression
/(a)(?(1)b|c)/.examples   # => NoMethodError; doable but complicated [2]
/\0/.examples             # => []; should be ["\u0000"]
/[a-&&]/.examples         # => ["a", "&"]; should be []
/(?u:\w)/.examples        # => NoMethodError; should be unicode word chars
/(?a)[[:word:]]/.examples # => NoMethodError; should be ascii word chars

[1] [2]

then there are some other gems (cough by me cough) that might be helpful and would benefit from contributors:

regexp_property_values reads out the codepoints matched by property or posix expressions directly from Ruby via a C API. might allow getting rid of the versioned codepoint databases in this gem. also works with old Rubies.

character_set calculates matched codepoints, e.g. of bracket expressions, in C. could be a performance boost or at least abstract away that part.

all three of these gems can be seen in use in js_regex.

i'll understand if you want to keep regexp-examples without dependencies, but feel free to take a look around this stuff.

jaynetics avatar Jul 27 '19 14:07 jaynetics

Hi Janosch, thank you for the comments.

Firstly, I'd like to explain that that this project was originally created 5 years ago (!!), with the original intent of being a small personal challenge to generate examples for a very limited subset of regular expressions - e.g. /a*b+c?/ ... Over time, it gradually evolved into this "complete" solution -- and I'm well aware that the parser complexity has grown big enough to warrant being a gem of its own 😅

In retrospect yes, if I were to write the whole thing again today, I would almost certainly try to utilise some other library like regexp_parser rather than attempt to build my own -- but you have to understand the context in which this was written (i.e. my younger self, wanting to figure out the intricacies of regular expressions and reverse engineer it all myself).

And in addition, in hindsight, the test suite could be far more comprehensive by directly using the onigmo specs! (If this library "misbehaves" for a regexp that Onigmo itself does not test, it should probably be documented there first!)

I've seriously considered a major rewrite of the library to use some dependencies for a few years now, like you suggested, and release a v2.0 of this gem... The main thing holding me back, in all honesty, is that it's a lot of effort and will only really fix 'very obscure' issues (no one has raised any of the above bugs, in 5 years of the gem being public)

...But from a purist perspective, I'd love for this library to be 100% perfect. So let's talk about those libraries you mentioned:

Using regexp_parser would be a major change (though should make the code much simpler), but would only fix issues for extremely obscure syntax .... I'd certainly welcome a PR, and may even look into it myself one day, but it hasn't been at the top of my priorities.

The issues surrounding named properties and character sets, however, have bugged me for a long time - I haven't seen your libraries (character_set and regexp_property_values) until today (they are much newer than this gem!), and at a glance they may serve well as a long-overdue solution to the problem! If you don't have time for PR yourself, I'll definitely take a closer look myself at some point -- thanks! 😄

tom-lord avatar Jul 28 '19 12:07 tom-lord

Right now, regexp-examples can install without a natively compiled component. For me, that is a prerequisite to using it.

Grüße, Carsten

cabo avatar Jul 28 '19 21:07 cabo

@tom-lord i can relate well to the background of this gem. it's just how i started with js_regex, asking myself, how much work can it be? answer: a lot 😂

i could perhaps provide a PR to integrate regexp_parser, or maybe just integrate it for some syntax features to get an impression of what it would look like.

regarding the other libraries, using character_set might make most sense together with regexp_parser as character_set can only parse the most basic bracket expressions on it's own - no intersections, types, properties, posix classes etc.

using regexp_property_values on the other hand should probably be as simple as replacing CharSets::NamedPropertyCharMap with RegexpPropertyValues. as a side note, this would add support for some more spellings that are permitted by Onigmo, such as \p{symbol currency}, and would make updates to deal with new properties in future Rubies obsolete.

@cabo can you elaborate? are you using a non-C Ruby? regexp_parser is a pure Ruby library and the others include Ruby fallback code, so should work in all environments (although i've only tested jruby; other Rubies might require minor tweaks).

jaynetics avatar Jul 29 '19 08:07 jaynetics

Hi Janosch,

I’m trying to provide a software distribution that is used on Windows PCs as well as UNIX/Linux gear, not necessarily by dev types. Doing a native compile increases the complexity of getting this going significantly, so I have resolved to only use Ruby libraries. Yes, that means limitations such as using REXML instead of Nokogiri etc… Unfortunately, it is not sufficient that it “should work”, it has to actually work! Using any library that has a native dependency therefore is problematic for me.

Now I read “ragel” and thought that might be used to build a native library. It doesn’t look that way when I just install regexp_parser, but I didn’t have time to look at the libraries in detail. Sorry if I raised a false alarm.

(I still need a parser/translator from W3C regexps into Ruby’s… Unless I find anything, I probably will write that in the next few weeks.)

Grüße, Carsten

On Jul 29, 2019, at 10:12, Janosch Müller [email protected] wrote:

@cabo can you elaborate? are you using a non-C Ruby? regexp_parser is a pure Ruby library and the others include Ruby fallback code, so should work in all environments (although i've only tested jruby; other Rubies might require minor tweaks).

cabo avatar Jul 29 '19 10:07 cabo

(I still need a parser/translator from W3C regexps into Ruby’s… Unless I find anything, I probably will write that in the next few weeks.)

Well, almost two hundred weeks actually.

The project is now called iregexp, and you can find it at https://github.com/cabo/iregexp

cabo avatar Apr 29 '23 13:04 cabo

@cabo congratulations on your progress!

it is an interesting project that i imagine could be of interest in more domains, not just jsonpath.

a bit off-topic, but one quick thought on it:

supporting unicode properties in iregexp feels a bit like a "backup" made necessary by not supporting char-type escapes such as \d. i understand the interoperability reasons beyond the latter decision, but i'm wondering about the practicality. the char type escapes are very well-established, and property escapes are little-known among developers. maybe extra rules for the translation to other flavors, such as replacing \d with an appropriate char set or property where necessary, would be acceptable to support them? unicode properties can also be a major drag on inter-operability, and even where they are supported, they are not 100% inter-operable anyway, as various environments tend to be on different unicode versions...

anyway, to get back on topic:

at some point in these past 4 years, i had quick a look at using regexp_parser in regexp-examples.

my impression was that it would be more work than building a similar example-generating tool from scratch, at least for me.

so from my point of view, this issue can be closed :)

if you ever end up using character_set or regexp_property_values @tom-lord, note that these do include C-extensions. the extensions are optional, and the fallback code has successfully been used in some non-C-Ruby environments, but I wouldn't vouch for them to work in every exotic setup, so a major version bump might be appropriate.

cheers! J

jaynetics avatar May 01 '23 17:05 jaynetics