Rewrite of internal regex generation.
The following example generates the tag meta.specifier.linkage.$14.cpp, however, there are only 13 capture groups
cpp_grammar[:extern_linkage_specifier] = extern_linkage_specifier = newPattern(
match: std_space.then(
match: /extern/,
tag_as: "storage.type.extern"
).then(std_space).maybe(
# This doesn't match the spec as the spec says any string literal is
# allowed in a linkage specification. However, the spec also says that
# linkage specification strings should match the name that is being linked.
# It is unlikely that a language would have a double quote or newline
# in its name.
match: newPattern(
match: /\"/,
tag_as: "punctuation.definition.string.begin"
).then(
match: zeroOrMoreOf(/[^\"]/),
reference: "linkage"
).then(
match: /\"/,
tag_as: "punctuation.definition.string.end"
),
tag_as: "string.quoted.double.extern"
),
tag_as: "meta.specifier.linkage.$reference(linkage)"
)
According to regex 101 the correct group should have been group 12 https://regex101.com/r/doQVjJ/1
yeah this is going to be a pain to fix and verify
What if, instead of processing things immediately, and producing a big regex value, the value was just saved.
Current behavior
value = /hello/.then(/world/)
value >>> /hello(?:world)/
New Behavior
value = /hello/.then(/world/)
value >>> /hello/
and value.next_pattern >>> { type: "then", arguments: /world/ }
Basically it waits, then once the .to_tag method is called it can iterate down the chain of calls, give each one a group number, and produce the final regex string. This would also make it possible to verify that references actually exist when they are called.
The arguments for the next pattern could also be stored in the hash. For example:
value = /hello/.then(
match: /world/,
at_least: 1.times
)
value >>> /hello/
and value.next_pattern >>>
{
type: "then",
arguments: {
match: /world/,
at_least: 1.times
},
}
So value.then(/foo/) would follow the linked list of next_pattern until nil was found, then place /foo/ there, correct?
I actually hadn't considered that, but yeah that's a good solution.
On the internals side, all of the helpers (both then and .then forms) would only be doing the linked list insertion of their name and arguments. Then there would just be the to_tag method that handles the actual conversion logic. Pretty much everything else would be a helper function for to_tag.
What are your thoughts of or optionally taking an array of patterns? The english-ness is reduced some but it fixes the issue of a.or(b).or(c).then(d) actually becoming a|b|cd.
Alternatively, this could be provided as .oneOf([...]).
I think the oneOf() would be great πit would also fix the very ugly indentation of
/a/.or(
/b/
).or(
/c/
)
To reiterate, the pattern class would have the following members:
type- the type of the pattern (then|or|backref|subcall)- This is most likely going to be used by the to_* methods (to_tag, to_s, to_r), is there a reason this can't be done through inheritance?
next_pattern- This forms a linked list of pattern objectsarguments- a copy of the arguments passed into_tag- converts the pattern to a textmate patternto_s- converts the pattern to its canonical ruby source representation
Additionally, I want to propose the following members
start_pattern- a reference to itself- This is so linters/sanity checkers can treat pattern ranges and patterns the same
to_r- converts this pattern to a Regexpname-arguments.reference || arguments.tag_as || to_s- used for displaying an issue with this pattern
Potentially PatternRange could be a subclass of Pattern.
start_patternandend_patternwould return their respective patternsto_rwould raise an error preventing the use of a pattern range in a pattern
I was thinking of modifying the Regexp class to keep the //.then() syntax. But it would be more professional to have a legit Pattern class. Pattern.new() isn't too bad. So I guess yeah lets go with an actual Pattern class. I guess we could always make all the //.then()'s return a Pattern object instead of a regex object.
I think the start_pattern, to_r, and name are great. I think to_s will be preferable for debugging, but it would still be good to have a name attribute. Making PatternRange as a subclass would be great too. It could make a lot of things more simple.
I'd probably also add a regex attribute, and let arguments be all of the other options.
so the //.then(/thing/) would have no arguments, but would have regex
same with then(match: /thing/), no arguments, but would have regex
class Pattern
@regex
@type
@arguments
@next_regex
def name
end
def to_tag
end
def to_s
end
def to_r
end
def start_pattern
end
end
my concern with the regex attribute is that patterns can be constructed from other patterns
a = Pattern.new(
match: zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times),
)
what would a.regex be?
Edit: is regex a renaming of match? So a.regex would be a reference to the zeroOrMoreOf(...) ?
I took the class skeleton you provided and made a very basic proof of concept.
Captures are not yet implemented and only the plain form of .then() is supported. You can find it in source/pattern.rb of the Tool/misc/pattern branch.
Example output
regex:
/(abc)(?:def)/
tag:
{:match=>"(abc)(?:def)", :captures=>nil, :name=>"abc"}
name:
abc
canonical form:
Pattern.new(
match: /abc/,
tag_as: abc,
reference: abc,
).then(
match: /def/,
)
puts "regex:"
puts test.to_r.inspect
puts "\ntag:"
puts test.to_tag
puts "\nname:"
puts test.name
puts "\ncanonical form:"
puts test
Edit: The structure of the Pattern class is starting to take shape. then (basic) and maybe chaining functions are supported. Adding a new expression type can be really simple (13 lines for maybe). It has support for rendering a Pattern as a string. The Pattern constructor raises an exception when passed in a Regexp with capture groups.
my concern with the regex attribute is that patterns can be constructed from other patterns
a = Pattern.new( match: zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times), )
I missed this message. I think the regex would be the to_r of zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times)
I took the class skeleton you provided and made a very basic proof of concept. Captures are not yet implemented and only the plain form of
.then()is supported. You can find it insource/pattern.rbof theTool/misc/patternbranch.
Thank you! I'll check it out
Wow you did a ton of work on Tool/misc/pattern! That debugging output looks fantastic. Lets just develop the full solution on that branch.
What I can do is change all of the newPattern's into Pattern.new on a separate brach and merge it into master. Then the Tool/misc/pattern can be tested until it performs the same as the old method. At that point it can seamlessly replace the old method.
In the next hour I should have a rewritten capture method to hopefully allow for sub expressions to work.
In particular the recursiveness of the design makes determining final group numbers impossible without carraying some capture info along the recursive call chain.
Backreference support has been added.
Removed the != nil on a bunch of checks per https://rubystyle.guide/#no-non-nil-checks, seperated the concept of group info and capture hash.
I haven't figured out why but the following snippet makes an odd output
test = Pattern.new(
match: Pattern.new(/abc/).then(match: /aaa/, tag_as: "aaa"),
tag_as: "abc",
reference: "abc"
).maybe(/def/).then(
match: /ghi/,
tag_as: "ghi"
).lookAheadFor(/jkl/).matchResultOf("abc")
test2 = test.then(/mno/)
puts test2
Output:
Pattern.new(
match: Pattern.new(
match: /abc/,
).then(
match: /aaa/,
tag_as: aaa,
),
tag_as: abc,
reference: abc,
).maybe(
match: /def/,
).then(
match: /ghi/,
tag_as: ghi,
).lookAround(
match: /jkl/,
type: :lookAheadFor,
).matchResultOf("{:backreference_key=>"abc", :match=>/[:backreference:abc:]/}").then(
match: /mno/,
)
Edit: this was because the __deep_clone__ method passes in a hash and the initialize function was not expecting that
Thanks for using the style guide, I actually do prefer the nil checks for 2 reasons though.
1. I want false to pass the test, e.g. I only wanted to check for existence not truthy-ness.
2. if variable requires you to know what the language considers as "truthy", and that varies quite a bit from language to language. Ruby considers 0 to be true, which most programmers probably wouldn't expect. Its also non-obvious or hard to intuitively remember if empty lists, empty strings, or empty objects are considered "truthy" in Ruby. I prefer the != nil since I think it tells people exactly whats happening even if they don't know Ruby. Very few people would expect 0 == nil, except maybe Brendan Eich.
I don't recommend this to you, but, if I could enforce it on myself, I'd probably actually prefer if variable == true and if variable == false despite its redundant nature.
However, it would probably be best to get rid of almost all of the != nil in favor of actual .is_a?(Type) checks to make sure the data is the structure we were looking for instead of just not nil.
So if @arguments[:tag_as] would become if @arguments[:tag_as].is_a? String ?
Hmm that is a good question. The validity of @arguments[:tag_as] (e.g. checking that its a string) would be checked at the initializer. Having two String checks would be redundant, and if for some reason we wanted to change it, we'd then have to change the check in two places and one would be easy to forget and would cause an edge case bug.
I'd say thats actually a good place to do a != nil check, since we only want to make sure that the value exists
Also, if you want to get rid of the optimization of the outer group, I think thats fine. I doubt it provides a performance benefit, I originally did it so that Ruby could generate JSON patterns that looked more like they were hand-written. The optimization group adds complexity, and is probably the reason this issue #339 was caused in the first place. If you don't think its a big deal to keep it though thats fine with me too
It already exists in the pattern class and having the group attributes carry around what their group number is, minimizes a lot of the risk of forgetting to pass into some helper was_first_group_removed.
I wasn't expecting all of the MaybePattern < Pattern classes, but I think thats a very clean way of doing it by overriding the do_modify_regex and do_get_to_s_name, great thinking π
Oh and btw, the :repository_name does not need to be carried over. Thats something that should probably not exist.
This is genius, I was just planning on using the remove_numbered_capture_groups
I think for the MaybePattern we can do similar to what you've done with the LookAroundPattern
Quantifiers and groups can be multiple separate options, grouped, not-grouped, capturing group, non-capturing group, atomic group, non-atomic, and then all possible quantifier permutations.
I think we can handle all of that logic in one place, and the sub pattern-class, like MaybePattern can just set those options and save its output type for debugging.
for the @regex, lets change it to @match which will make more readable sense. We can also add a to_r method to the Regexp class so that we don't have to check the type e.g. we can just call @match.to_r and get the regex form
we can change @next_regex to @next_match as well
Sounds good.
Maybe @next_pattern? That is always a pattern.
oh, if its always a pattern then yeah @next_pattern would be best