better-cpp-syntax icon indicating copy to clipboard operation
better-cpp-syntax copied to clipboard

Rewrite of internal regex generation.

Open matter123 opened this issue 6 years ago β€’ 122 comments

The following example generates the tag meta.specifier.linkage.$14.cpp, however, there are only 13 capture groups

cpp_grammar[:extern_linkage_specifier] = extern_linkage_specifier = newPattern(
    match: std_space.then(
        match: /extern/,
        tag_as: "storage.type.extern"
    ).then(std_space).maybe(
        # This doesn't match the spec as the spec says any string literal is
        # allowed in a linkage specification. However, the spec also says that
        # linkage specification strings should match the name that is being linked.
        # It is unlikely that a language would have a double quote or newline
        # in its name.
        match: newPattern(
            match: /\"/,
            tag_as: "punctuation.definition.string.begin"
        ).then(
            match: zeroOrMoreOf(/[^\"]/),
            reference: "linkage"
        ).then(
            match: /\"/,
            tag_as: "punctuation.definition.string.end"
        ),
        tag_as: "string.quoted.double.extern"
    ),
    tag_as: "meta.specifier.linkage.$reference(linkage)"
)

According to regex 101 the correct group should have been group 12 https://regex101.com/r/doQVjJ/1

matter123 avatar Aug 03 '19 20:08 matter123

yeah this is going to be a pain to fix and verify

jeff-hykin avatar Aug 05 '19 01:08 jeff-hykin

What if, instead of processing things immediately, and producing a big regex value, the value was just saved.

Current behavior

value = /hello/.then(/world/) value >>> /hello(?:world)/

New Behavior

value = /hello/.then(/world/) value >>> /hello/ and value.next_pattern >>> { type: "then", arguments: /world/ }

Basically it waits, then once the .to_tag method is called it can iterate down the chain of calls, give each one a group number, and produce the final regex string. This would also make it possible to verify that references actually exist when they are called.

The arguments for the next pattern could also be stored in the hash. For example:

value = /hello/.then(
    match: /world/,
    at_least: 1.times
)

value >>> /hello/ and value.next_pattern >>>

{ 
    type: "then",
    arguments: {
        match: /world/,
        at_least: 1.times
    },
}

jeff-hykin avatar Aug 05 '19 01:08 jeff-hykin

So value.then(/foo/) would follow the linked list of next_pattern until nil was found, then place /foo/ there, correct?

matter123 avatar Aug 05 '19 01:08 matter123

I actually hadn't considered that, but yeah that's a good solution.

jeff-hykin avatar Aug 05 '19 02:08 jeff-hykin

On the internals side, all of the helpers (both then and .then forms) would only be doing the linked list insertion of their name and arguments. Then there would just be the to_tag method that handles the actual conversion logic. Pretty much everything else would be a helper function for to_tag.

jeff-hykin avatar Aug 05 '19 02:08 jeff-hykin

What are your thoughts of or optionally taking an array of patterns? The english-ness is reduced some but it fixes the issue of a.or(b).or(c).then(d) actually becoming a|b|cd.

Alternatively, this could be provided as .oneOf([...]).

matter123 avatar Aug 05 '19 05:08 matter123

I think the oneOf() would be great πŸ‘it would also fix the very ugly indentation of

/a/.or(
    /b/
).or(
    /c/
)

jeff-hykin avatar Aug 05 '19 17:08 jeff-hykin

To reiterate, the pattern class would have the following members:

  • type - the type of the pattern (then|or|backref|subcall)
    • This is most likely going to be used by the to_* methods (to_tag, to_s, to_r), is there a reason this can't be done through inheritance?
  • next_pattern - This forms a linked list of pattern objects
  • arguments - a copy of the arguments passed in
  • to_tag - converts the pattern to a textmate pattern
  • to_s - converts the pattern to its canonical ruby source representation

Additionally, I want to propose the following members

  • start_pattern - a reference to itself
    • This is so linters/sanity checkers can treat pattern ranges and patterns the same
  • to_r - converts this pattern to a Regexp
  • name - arguments.reference || arguments.tag_as || to_s
    • used for displaying an issue with this pattern

Potentially PatternRange could be a subclass of Pattern.

  • start_pattern and end_pattern would return their respective patterns
  • to_r would raise an error preventing the use of a pattern range in a pattern

matter123 avatar Aug 05 '19 20:08 matter123

I was thinking of modifying the Regexp class to keep the //.then() syntax. But it would be more professional to have a legit Pattern class. Pattern.new() isn't too bad. So I guess yeah lets go with an actual Pattern class. I guess we could always make all the //.then()'s return a Pattern object instead of a regex object.

I think the start_pattern, to_r, and name are great. I think to_s will be preferable for debugging, but it would still be good to have a name attribute. Making PatternRange as a subclass would be great too. It could make a lot of things more simple.

jeff-hykin avatar Aug 05 '19 20:08 jeff-hykin

I'd probably also add a regex attribute, and let arguments be all of the other options. so the //.then(/thing/) would have no arguments, but would have regex same with then(match: /thing/), no arguments, but would have regex

class Pattern
    @regex
    @type
    @arguments
    @next_regex

    def name
    end
    def to_tag
    end
    def to_s
    end
    def to_r
    end
    def start_pattern
    end
end

jeff-hykin avatar Aug 05 '19 20:08 jeff-hykin

my concern with the regex attribute is that patterns can be constructed from other patterns

a = Pattern.new(
  match: zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times),
)

what would a.regex be?

Edit: is regex a renaming of match? So a.regex would be a reference to the zeroOrMoreOf(...) ?

matter123 avatar Aug 05 '19 21:08 matter123

I took the class skeleton you provided and made a very basic proof of concept. Captures are not yet implemented and only the plain form of .then() is supported. You can find it in source/pattern.rb of the Tool/misc/pattern branch.

Example output

regex:
/(abc)(?:def)/

tag:
{:match=>"(abc)(?:def)", :captures=>nil, :name=>"abc"}

name:
abc

canonical form:
Pattern.new(
  match: /abc/,
  tag_as: abc,
  reference: abc,
).then(
  match: /def/,
)
puts "regex:"
puts test.to_r.inspect
puts "\ntag:"
puts test.to_tag
puts "\nname:"
puts test.name
puts "\ncanonical form:"
puts test

Edit: The structure of the Pattern class is starting to take shape. then (basic) and maybe chaining functions are supported. Adding a new expression type can be really simple (13 lines for maybe). It has support for rendering a Pattern as a string. The Pattern constructor raises an exception when passed in a Regexp with capture groups.

matter123 avatar Aug 05 '19 22:08 matter123

my concern with the regex attribute is that patterns can be constructed from other patterns

a = Pattern.new(
  match: zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times),
)

I missed this message. I think the regex would be the to_r of zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times)

jeff-hykin avatar Aug 08 '19 23:08 jeff-hykin

I took the class skeleton you provided and made a very basic proof of concept. Captures are not yet implemented and only the plain form of .then() is supported. You can find it in source/pattern.rb of the Tool/misc/pattern branch.

Thank you! I'll check it out

jeff-hykin avatar Aug 08 '19 23:08 jeff-hykin

Wow you did a ton of work on Tool/misc/pattern! That debugging output looks fantastic. Lets just develop the full solution on that branch.

What I can do is change all of the newPattern's into Pattern.new on a separate brach and merge it into master. Then the Tool/misc/pattern can be tested until it performs the same as the old method. At that point it can seamlessly replace the old method.

jeff-hykin avatar Aug 09 '19 00:08 jeff-hykin

In the next hour I should have a rewritten capture method to hopefully allow for sub expressions to work.

In particular the recursiveness of the design makes determining final group numbers impossible without carraying some capture info along the recursive call chain.

matter123 avatar Aug 09 '19 00:08 matter123

Backreference support has been added.

matter123 avatar Aug 09 '19 01:08 matter123

Removed the != nil on a bunch of checks per https://rubystyle.guide/#no-non-nil-checks, seperated the concept of group info and capture hash.

I haven't figured out why but the following snippet makes an odd output

test = Pattern.new(
    match: Pattern.new(/abc/).then(match: /aaa/, tag_as: "aaa"),
    tag_as: "abc",
    reference: "abc"
).maybe(/def/).then(
    match: /ghi/,
    tag_as: "ghi"
).lookAheadFor(/jkl/).matchResultOf("abc")

test2 = test.then(/mno/)
puts test2

Output:

Pattern.new(
  match: Pattern.new(
      match: /abc/,
    ).then(
      match: /aaa/,
      tag_as: aaa,
    ),
  tag_as: abc,
  reference: abc,
).maybe(
  match: /def/,
).then(
  match: /ghi/,
  tag_as: ghi,
).lookAround(
  match: /jkl/,
  type: :lookAheadFor,
).matchResultOf("{:backreference_key=>"abc", :match=>/[:backreference:abc:]/}").then(
  match: /mno/,
)

Edit: this was because the __deep_clone__ method passes in a hash and the initialize function was not expecting that

matter123 avatar Aug 09 '19 07:08 matter123

Thanks for using the style guide, I actually do prefer the nil checks for 2 reasons though.
1. I want false to pass the test, e.g. I only wanted to check for existence not truthy-ness.
2. if variable requires you to know what the language considers as "truthy", and that varies quite a bit from language to language. Ruby considers 0 to be true, which most programmers probably wouldn't expect. Its also non-obvious or hard to intuitively remember if empty lists, empty strings, or empty objects are considered "truthy" in Ruby. I prefer the != nil since I think it tells people exactly whats happening even if they don't know Ruby. Very few people would expect 0 == nil, except maybe Brendan Eich.

I don't recommend this to you, but, if I could enforce it on myself, I'd probably actually prefer if variable == true and if variable == false despite its redundant nature.

However, it would probably be best to get rid of almost all of the != nil in favor of actual .is_a?(Type) checks to make sure the data is the structure we were looking for instead of just not nil.

jeff-hykin avatar Aug 09 '19 16:08 jeff-hykin

So if @arguments[:tag_as] would become if @arguments[:tag_as].is_a? String ?

matter123 avatar Aug 09 '19 16:08 matter123

Hmm that is a good question. The validity of @arguments[:tag_as] (e.g. checking that its a string) would be checked at the initializer. Having two String checks would be redundant, and if for some reason we wanted to change it, we'd then have to change the check in two places and one would be easy to forget and would cause an edge case bug.

I'd say thats actually a good place to do a != nil check, since we only want to make sure that the value exists

jeff-hykin avatar Aug 09 '19 16:08 jeff-hykin

Also, if you want to get rid of the optimization of the outer group, I think thats fine. I doubt it provides a performance benefit, I originally did it so that Ruby could generate JSON patterns that looked more like they were hand-written. The optimization group adds complexity, and is probably the reason this issue #339 was caused in the first place. If you don't think its a big deal to keep it though thats fine with me too

jeff-hykin avatar Aug 09 '19 16:08 jeff-hykin

It already exists in the pattern class and having the group attributes carry around what their group number is, minimizes a lot of the risk of forgetting to pass into some helper was_first_group_removed.

matter123 avatar Aug 09 '19 16:08 matter123

I wasn't expecting all of the MaybePattern < Pattern classes, but I think thats a very clean way of doing it by overriding the do_modify_regex and do_get_to_s_name, great thinking πŸ‘

jeff-hykin avatar Aug 09 '19 16:08 jeff-hykin

Oh and btw, the :repository_name does not need to be carried over. Thats something that should probably not exist.

jeff-hykin avatar Aug 09 '19 16:08 jeff-hykin

Screen Shot 2019-08-09 at 12 08 22 PM This is genius, I was just planning on using the remove_numbered_capture_groups

jeff-hykin avatar Aug 09 '19 17:08 jeff-hykin

I think for the MaybePattern we can do similar to what you've done with the LookAroundPattern

Quantifiers and groups can be multiple separate options, grouped, not-grouped, capturing group, non-capturing group, atomic group, non-atomic, and then all possible quantifier permutations.

I think we can handle all of that logic in one place, and the sub pattern-class, like MaybePattern can just set those options and save its output type for debugging.

jeff-hykin avatar Aug 09 '19 17:08 jeff-hykin

for the @regex, lets change it to @match which will make more readable sense. We can also add a to_r method to the Regexp class so that we don't have to check the type e.g. we can just call @match.to_r and get the regex form

we can change @next_regex to @next_match as well

jeff-hykin avatar Aug 09 '19 18:08 jeff-hykin

Sounds good.

Maybe @next_pattern? That is always a pattern.

matter123 avatar Aug 09 '19 18:08 matter123

oh, if its always a pattern then yeah @next_pattern would be best

jeff-hykin avatar Aug 09 '19 18:08 jeff-hykin