grammars-v4 Add grammars-v4 dir indexer

Add grammars-v4 dir indexer

Open parrt opened this issue 1 year ago • 23 comments

Hi. (I screwed up and push to master despite having a git branch command in my history. Weird.) Related to supporting lab.antlr.og and https://github.com/antlr/antlr4-lab/issues/11...

@teverett @kaby76 @KvanTTT please take a look at _scripts/mkindex.py, which generates a json list from a grammars-v4 path name:

$ python mkindex.py .. | jq
[
  {
    "name": "regex",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/xsd-regex/regexLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/xsd-regex/regexParser.g4",
    "start": "root",
    "example": "example-chargroup-sub3.txt"
  },
  {
    "name": "abb",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/abb/abbLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/abb/abbParser.g4",
    "start": "module",
    "example": "robdata.sys"
  },
  {
    "name": "DGS",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/graphstream-dgs/DGSLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/graphstream-dgs/DGSParser.g4",
    "start": "dgs",
    "example": "removeAttribute.dgs"
  },
  {
    "name": "SwiftFin",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/swift-fin/SwiftFinLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/swift-fin/SwiftFinParser.g4",
    "start": "messages",
    "example": "test1.txt"
  },
  {
    "name": "Lucene",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/lucene/LuceneLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/lucene/LuceneParser.g4",
    "start": "topLevelQuery",
    "example": "boolean-3.txt"
  },
  {
    "name": "Cql",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/cql3/CqlLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/cql3/CqlParser.g4",
    "start": "root",
    "example": "createIndex.cql"
  },
...

Oct 07 '22 21:10 parrt

Hey @parrt great :)

Your intention is to use this to generate a JSON file for the lab tool? Could we also use it to generate a human readable index. Could that human-readable index be automatically included into the readme.md?

Oct 08 '22 01:10 teverett

Your intention is to use this to generate a JSON file for the lab tool?

yep! @kaby76 has a bash script now.

Could trivially generate an index for the readme! I gen a list of dictionaries with info per grammar. Can gen markdown instead of json easily.

Oct 08 '22 01:10 parrt

It would be amazing to generate an index into the readme.md. That way we could take on the refactoring @KOLANICH has suggested and still enable people to find their favourite grammar from the main page. It would actually be a significant improvement over what we have today since, for example, there are assembler grammars under /asm which people may not realize are there.

Oct 08 '22 01:10 teverett

could a github action trigger a reindex too?

Oct 08 '22 02:10 parrt

well i've never done that. Perhaps.

Oct 08 '22 02:10 teverett

Yes, Github Actions could trigger a build of the index, but it would need to be somehow checked back into the tree without causing a whole new build/git ci ad infinitum. I'll try to play around with this.

Oct 08 '22 12:10 kaby76

Honestly, I don't think such description file is required at all. It's possible just to traverse directories and discover grammar files there. We have single pattern for grammar files and examples directory. Moreover, we have .pom files that also help (it contains info about root rule, examples directory and so on).

Oct 08 '22 16:10 KvanTTT

@KvanTTT Well, I have an implementation that works off a generated grammars-v4.js file, which is equivalent to an index of the grammars-v4 repo in json format. If you can write an implementation of antlr4-lab that works over the wire, we can look at it and compare. Note, I was planning to make a separate implementation that gathers information over the internet from the poms, but I wanted to first follow through on this design.

Oct 08 '22 17:10 kaby76

I think we'll go with a generated index if we can. A casual user looking for a grammar is likely to make use of it, and perhaps less likely to click through directories searching.

@kaby76 full support for the GH Actions work you're doing.

Oct 08 '22 20:10 teverett

Plus I need something I can simply download from the antlr lab to get the files.

Oct 08 '22 20:10 parrt

I think I found the trick to do a check-in of the generated index file. See this PR. The current build doesn't kick it off because it's in a separate workflow (here). But, presumably once the workflow is added, it should work. I tested it on my own repo.

Oct 09 '22 23:10 kaby76

@kaby76 could it also generate markdown?

Oct 10 '22 02:10 teverett

@teverett Yes, we can add a script to build some markdown.

Oct 10 '22 09:10 kaby76

The indexer isn't quite working yet. Something's wrong with the workflow, even though it works on a semi-duplicated "grammars-v4" repo over in my github.com account (https://github.com/kaby76/temp-with-actions). https://github.com/antlr/grammars-v4/pull/2881

Oct 11 '22 13:10 kaby76

"git diff --quiet" returns 0 even if there are untracked files. Git bites me again.

Oct 11 '22 17:10 kaby76

The last change seemed to fix the problem with indexing, and we have an updated "grammars.json" file from @parrt 's indexer. Let's see how this works in lab.antlr.org.

Oct 12 '22 09:10 kaby76

The generated file from _scripts/mkindex.py contains entries where grammars are named by the declared name in the pom.xml (or maybe in the grammarDecl itself?). There are multiple grammars with the same names, e.g., the grammar at grammars-v4/javascript/javascript/ and the grammar at grammars-v4/javascript/jsx/ are both "JavaScript". How does one distinguish between the two except by looking at content, e.g., the Antlr4 grammars?
Only about 60 grammars are listed in the generated index. I'm not sure why. There should be over 200.
The index doesn't sort the grammar entries by "name".

Oct 12 '22 11:10 kaby76

I updated my fork of the Antlr lab to read the grammars.json file. Looks good. https://github.com/kaby76/antlr4-lab/tree/add-grammars-v4

You can see it in action on this droplet while it's up. http://134.209.209.215/

Oct 12 '22 13:10 kaby76

If we want to add information to the repo for programming language classification, reference links, or any other information, we're going to need to determine where to add it.

Add it to the readme.md, but make it more standardized

We could add the information for a grammar to the readme. Right now, there is no standardized format of what to document.

In the existing pom.xml, per grammar

The pom.xml could contain the information in the generation of the index, but it would have to be added carefully within the file. I tried to add a <classification>....</classification> under /project, and it was rejected. I then tried to place it under the first /project/build/plugins/plugin' and it too was rejected. I finally placed under /project/build/plugins/plugin/configurationelement and it finally worked, presumably because theantlr4-maven-pluginandantlr4test-maven-plugin` don't check spurious elements.

Alternatively, we could create a new plugin that doesn't little except for a place to nest information on the grammar within the pom.xml.

In a separate xml file, per grammar

We could add a new file with structured information for indexing.

In a global xml file

We could add grammar information in one big file in the root of the repo.

Oct 13 '22 13:10 kaby76

Only about 60 grammars are listed in the generated index. I'm not sure why. There should be over 200.

I noticed that as well but I had fairly strict constraints. If you take a look at the code you'll notice that it tosses out anything without an example I think or where there are more than two grammars. One I saw had a lexer, parser, and "hints" grammar. Strip those out because the ANTLR lab won't be able to handle those. Probably we need a flag on that indexer to generate some thing for the repository and something for the lab.

Oct 13 '22 16:10 parrt

There are multiple grammars with the same names, e.g., the grammar at grammars-v4/javascript/javascript/ and the grammar at grammars-v4/javascript/jsx/ are both "JavaScript".

rats. Would it be unique if we included the directories containing the grammar like javascript/JavaScript? Or, perhaps the grammar file names are unique?

Oct 13 '22 16:10 parrt

I updated my fork of the Antlr lab to read the grammars.json file.

Seems like we'd want the grammar.json file in this repo not the lab right? In other words the lab uses it but doesn't own it and doesn't have the code to generate it.

Oct 13 '22 16:10 parrt

The pom.xml could contain the information in the generation of the index

Seems to make sense to keep the classification or location within the ontology at the definition of the grammar and then we have a tool that pulls that information to create an index. We could also have multiple classifications or tags to create different kinds of indexes like Assembly code versus Data language versus high-level language etc...

Oct 13 '22 16:10 parrt

grammars-v4 grammars-v4 copied to clipboard

Add grammars-v4 dir indexer

Add it to the readme.md, but make it more standardized

In the existing pom.xml, per grammar

In a separate xml file, per grammar

In a global xml file

grammars-v4
grammars-v4 copied to clipboard