better-cpp-syntax Publish a textmate gem

I'm creating this as a checklist for publishing the gem

[x] retagging feature
[x] create a wordCannotBe feature for patterns
[x] create a standard import pattern/export pattern method
[x] add a warning for when a repo name is being overwritten
[x] new pattern class errors when provided a regex that contains a capturing group
[x] add a check for newlines, issue a warning
[x] create a Pattern class instead of having newPattern use regex
[x] add a to_s method to Pattern and PatternRange
[x] remove the "smart" include behavior (.to_tag returning { include: "#repo_name" } )
[x] remove the @wrap_source
[ ] rename "while" to something more representative of its behavior
[x] remove the yaml export, only export as json
[x] create an "save_to" which exports the language tags, and json syntax
[ ] switch to snake_case for everything
[ ] add a bailout feature
[x] check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()
create a standard library of patterns
- [x] numerics
- [ ] block and parentheses finder
- [ ] strings and escapes
- [ ] inline comments
- [ ] standard space
add documentation
- [x] The Grammar class
  - [x] init arguments
  - $initial_context
  - setting repos
  - exporting
  - the named patterns (@spaces, @word)
- [ ] The Token class
  - That
    - using not()
  - RepresentationsThat
- [x] The Pattern class
  - [x] init arguments
    - [?] backreferences
    - [?] $match
    - [?] $reference
  - [x] all of the helper methods (then, or, zeroOrMoreOf, oneOrMoreOf, etc)
- [x] Pattern Range
  - [x] init arguments
- [ ] The standard library
add documentation to gitbook

Jul 08 '19 07:07 jeff-hykin

We may want to reevaluate generating *.tmLanguage.yaml files. With, afaik, no text editor or tool supporting yaml language, and with the JSON files being pretty-printed, I not sure what their purpose is.

Jul 19 '19 01:07 matter123

yeah. I think its safe to remove them, I only added yaml because I didn't pretty print the json.

Jul 25 '19 15:07 jeff-hykin

Fairly minor concern but

create a Pattern class instead of having newPattern use regex

would prevent /foo/.maybe(bar) from working, correct?

If so, there are 33 instances of that that would need to be replaced.

Jul 25 '19 19:07 matter123

Yeah, and that was the original reason that the pattern class was never created. If I made it a class the idea would be to clean everything up and let it be symbolic, and then only convert it to regex once to_tag was called. But I'm not sure it's worth all the trouble. It'd require a lot of library re writing just to make it a bit more straightforward under the hood

Jul 25 '19 19:07 jeff-hykin

I've also been debating the snake case switch. I prefer static data and like-static-data to be snake case and functions to be camel, but Ruby uses snake for both. If I switched it would only be for formatting consistency.

Jul 25 '19 19:07 jeff-hykin

It seems the majority of the uses for starting with a regexp is looking for a literal (17), with anchors being the nxt largest (6).

Jul 25 '19 20:07 matter123

The warning for newlines is probably too overly broad to be useful. In particular, the new include pattern trips up my initial warning.

Output:

There is a pattern that likely tries to match characters after \n
textmate grammars only operate on a single line, \n is the last possible character that can be matched.
Here is the pattern:
((?:^)((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))((#)\s*((?:(?:include|include_next)|import))\b)\s*(?:(?:(?:((<)[^>\n]*(>?)((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))(?:(?:\n|$)|(?=\/\/)))|((\")[^\"\n]*(\"?)((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))(?:(?:\n|$)|(?=\/\/))))|((?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))(?:(?:\n|$)|(?=\/\/))))|((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))(?:(?:\n|$)|(?=\/\/))))

Newline detection:

# check for matching after \n
if /\\n(.*?)(?=!\||\\n)/ =~ regex_as_string
    if /[^\^$\[\]\(\)?:+*=!<>\\]/ =~ $1
        
        puts "\n\nThere is a pattern that likely tries to match characters after \\n\n"
        puts "textmate grammars only operate on a single line, \\n is the last possible character that can be matched.\n"
        puts "Here is the pattern:\n"
        puts regex_as_string
        # raise "Error: see printout above"
    end
end

Jul 26 '19 01:07 matter123

Does

check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()

Refer to the fact that the pattern below will only give the last foo the scope of test.foo? If so, I think adding a warning about that would harm usability, composing patterns should be able to work without having to call turn_of_numbered_capture_groups whenever you want to match the pattern multiple times.

grammar[:test] = newPattern(
    match: oneOrMoreOf(newPattern(match:/foo/, tag_as:"test.foo").maybe(@spaces)),
    tag_as: "test"
)

Jul 26 '19 17:07 matter123

Does

check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()

Refer to the fact that the pattern below will only give the last foo the scope of test.foo? If so, I think adding a warning about that would harm usability, composing patterns should be able to work without having to call turn_of_numbered_capture_groups whenever you want to match the pattern multiple times.
grammar[:test] = newPattern(
    match: oneOrMoreOf(newPattern(match:/foo/, tag_as:"test.foo").maybe(@spaces)),
    tag_as: "test"
)

Yeah that was the problem I was referring to. I was thinking of automatically disabling the tags that are inside the oneOrMoreOf() or zeroOrMoreOf() and then automatically adding an includes: that has the repeated pattern one time, which I believe will then correctly tag all of the repetitions.

Jul 27 '19 01:07 jeff-hykin

\\n(.*?)(?=!\||\\n)

I was thinking it would be a warning just like the zero-length match warning. Its given, but it can be disabled because sometimes it's valid.

That said I'm not sure why there's a [^>\n] inside the pattern. There's other \n's too but [^>\n] should just be [^>]

Jul 27 '19 01:07 jeff-hykin

Fairly minor concern but

create a Pattern class instead of having newPattern use regex

would prevent /foo/.maybe(bar) from working, correct?

If so, there are 33 instances of that that would need to be replaced.

We could just do

class Pattern < Regexp
    def initialize(arguments)
        self.then(arguments)
    end
end

That we could use Pattern.new in the guides/explanations to be consistent with PatternRange.new()

Jul 27 '19 01:07 jeff-hykin

I think theres no way to do the use the turnOffNumberedCaptureGroups on every regex pattern without having a refactored Pattern class. Right now there's no way to fully tell the difference between pure regex and pattern-combined regex.

The next closest thing would be to count the number of capture groups there should be, and the number of capture groups their actually are.

The to_s method (for printing out a pattern the same way it looked in the code) can't be done without a refactored Pattern class either.

Jul 27 '19 01:07 jeff-hykin

That we could use Pattern.new in the guides/explanations to be consistent with PatternRange.new()

That sounds good.

I was thinking it would be a warning just like the zero-length match warning. Its given, but it can be disabled because sometimes it's valid.

That said I'm not sure why there's a [^>\n] inside the pattern. There's other \n's too but [^>\n] should just be [^>]

I'll clean up the include pattern, add the ability to disable the warning and push that branch.

Jul 27 '19 01:07 matter123

Another check that might be worth considering is wrong ordering of patterns with a common prefix.

If a pattern set contains the following patterns:

...
[A-Z_]+       # pattern 1
[A-Z_]+\s+foo # pattern 2
...

Pattern 2 will never match because it has a common prefix with pattern 1 and pattern 1 is first.

Aug 03 '19 19:08 matter123

Another check that might be worth considering is wrong ordering of patterns with a common prefix.

If a pattern set contains the following patterns:
...
[A-Z_]+       # pattern 1
[A-Z_]+\s+foo # pattern 2
...
Pattern 2 will never match because it has a common prefix with pattern 1 and pattern 1 is first.

Do you think this is detectable? I guess check for a perfect subset at the beginning of two patterns.

Aug 05 '19 01:08 jeff-hykin

Yeah,

for any pattern A of the pattern set 2..N:
	for each pattern B in the range 1..A-1:
		if B is a prefix of A, issue a warning/error

It would need some handling for extra parens, but even without that, it should catch the common case of [patternA, patternA.then(patternB)].

Aug 05 '19 01:08 matter123

the zeroLengthStart is a bit over-cautious, in particular, \G is usually intentional, and as such \G should not be enough to trigger the warning.

One possible solution is to do what vscode-textmate does when attempting to not evaluate \G, that is, replace \G with \uFFFF.

Aug 17 '19 02:08 matter123

Depending on whether you plan on supporting generating atom themes, you may want to split tag_as by the space character and generate multiple captures. See https://github.com/microsoft/vscode-textmate/issues/108#issuecomment-522197256

Aug 17 '19 02:08 matter123

It's somewhat of an odd case, but should the grammar generator detect when you are using a start of line anchor inside a sub match, and then rewrite it so that it works?.

Example: Orignal (non working):

Pattern.new(
  match: /\W+/,
  includes: [
    Pattern.new(match: /^bar/, tag_as: "bar")
  ]
)

Rewritten (working)

Pattern.new(
  match: /\W+/,
  includes: [
    PatternRange.new(
      start_pattern: lookAheadFor(/./),
      end_pattern: /$/,
      includes: [
        Pattern.new(match: /\Gbar/, tag_as: "bar")
      ]
    )
  ]
)

It is fairly unintuitive that /^/ doesn't work inside sub captures. And equally unituitive that the solution is to wrap it up in a PatteernRange.

Aug 17 '19 22:08 matter123

Depending on whether you plan on supporting generating atom themes, you may want to split tag_as by the space character and generate multiple captures. See microsoft/vscode-textmate#108 (comment)

I think this is a perfect situation for the library to handle (to support Atom), but it is low priority so we should put the effort in after the gem is published.

Aug 22 '19 22:08 jeff-hykin

It's somewhat of an odd case, but should the grammar generator detect when you are using a start of line anchor inside a sub match, and then rewrite it so that it works?.

Example: Orignal (non working):
Pattern.new(
  match: /\W+/,
  includes: [
    Pattern.new(match: /^bar/, tag_as: "bar")
  ]
)
Rewritten (working)
Pattern.new(
  match: /\W+/,
  includes: [
    PatternRange.new(
      start_pattern: lookAheadFor(/./),
      end_pattern: /$/,
      includes: [
        Pattern.new(match: /\Gbar/, tag_as: "bar")
      ]
    )
  ]
)
It is fairly unintuitive that /^/ doesn't work inside sub captures. And equally unituitive that the solution is to wrap it up in a PatteernRange.

Yeah I think this is another perfect case for the library to handle. Like with the Atom support though, handle it after the gem is published

Aug 22 '19 22:08 jeff-hykin

Since quite a few check items and potential additions perform transformations on the patterns, there should be some way of enabling logging whenever a transformation occurs.

Aug 22 '19 23:08 matter123

I think I'm going to try to publish this as a pre-alpha gem sometime this week. Its the first week of the semester and I want to try to re-structure this repo (get each language in a separate repo, and coordinate the change with Alexr00) before the semester gets into full swing, even though the checklist is far from done

The only thing I want to change before that is the import/export method. I think I'm going to change it to something like

require 'textmate_tools'

grammar = Grammer.get_exportable_grammar
grammar.exports = [
    :repo_name_1,
    :repo_name_2,
]
grammar.external_repos = [
    :std_space,
    :variable,
]

grammar[:repo_1] = Pattern.new()

The .exports are the names of the repos that should be exported The .external_repos are the ones you expect to already be defined by something else Every time that grammar sees a repo name (even in the includes of a pattern), if its not in the .exports and not in the .external_repos, it will create a mapping from that repo name to a randomly generated number so that the name never conflicts with a name from another package.

Aug 25 '19 08:08 jeff-hykin

That sounds pretty nice. I have two concerns:

The mapping should probably be stateless;
Rerunning npm run build should not change the generated grammar(builds should be reproducible)

so I'm thinking:

Grammar#get_exportable_grammar returns a new ExportGrammar that contains a reproducible salt (hash of the relative filename, some "export name", etc")
whenever the ExportGrammar encounters a name that is not in exports or external_repos, the name is prepended with the salt

Aug 25 '19 16:08 matter123

Yeah salt based on file name/path would be ideal

Aug 25 '19 17:08 jeff-hykin

With a unified export, to another grammar or to a file, how does an import look like?

Aug 25 '19 23:08 matter123

I was actually thinking they would be different (exporting a grammar to JSON vs exporting patterns)

Aug 27 '19 05:08 jeff-hykin

Ok so: Grammar::import(path) imports a json file to a grammar Grammar#save(path) saves a grammar to a json file Grammar#import(path_or_grammar_object) merge another grammar into this one Grammar::get_exportable_grammar returns a new ExportableGrammar

Aug 27 '19 06:08 matter123

Is importing a grammar from a path a useful feature? As it requires either converting a tag back into a pattern, or allow the grammar to store patterns as a tag.

Aug 27 '19 20:08 matter123

These were the only ones I was planning on having

a_grammar.import(path)
a_grammar.save_to(tm_language_path)
Grammar.get_exportable_grammar()

I think importing from a file is ideal because it prevents namespace pollution and guessing.

For example with with Doxygen, I didn't know what variables/functions it contributed. I had to open up the file and actually read the whole thing to realize it was really just exporting the doxygen function. When external files contribute common names like line_comment, they run the risk of accidentally overriding or accidentally getting overridden by something in generate.rb.

Aug 28 '19 16:08 jeff-hykin