grammars-v4 icon indicating copy to clipboard operation
grammars-v4 copied to clipboard

Add grammars-v4 dir indexer

Open parrt opened this issue 1 year ago • 23 comments

Hi. (I screwed up and push to master despite having a git branch command in my history. Weird.) Related to supporting lab.antlr.og and https://github.com/antlr/antlr4-lab/issues/11...

@teverett @kaby76 @KvanTTT please take a look at _scripts/mkindex.py, which generates a json list from a grammars-v4 path name:

$ python mkindex.py .. | jq
[
  {
    "name": "regex",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/xsd-regex/regexLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/xsd-regex/regexParser.g4",
    "start": "root",
    "example": "example-chargroup-sub3.txt"
  },
  {
    "name": "abb",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/abb/abbLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/abb/abbParser.g4",
    "start": "module",
    "example": "robdata.sys"
  },
  {
    "name": "DGS",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/graphstream-dgs/DGSLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/graphstream-dgs/DGSParser.g4",
    "start": "dgs",
    "example": "removeAttribute.dgs"
  },
  {
    "name": "SwiftFin",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/swift-fin/SwiftFinLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/swift-fin/SwiftFinParser.g4",
    "start": "messages",
    "example": "test1.txt"
  },
  {
    "name": "Lucene",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/lucene/LuceneLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/lucene/LuceneParser.g4",
    "start": "topLevelQuery",
    "example": "boolean-3.txt"
  },
  {
    "name": "Cql",
    "lexer": "https://raw.githubusercontent.com/antlr/grammars-v4/master/cql3/CqlLexer.g4",
    "parser": "https://raw.githubusercontent.com/antlr/grammars-v4/master/cql3/CqlParser.g4",
    "start": "root",
    "example": "createIndex.cql"
  },
...

parrt avatar Oct 07 '22 21:10 parrt

Hey @parrt great :)

Your intention is to use this to generate a JSON file for the lab tool? Could we also use it to generate a human readable index. Could that human-readable index be automatically included into the readme.md?

teverett avatar Oct 08 '22 01:10 teverett

Your intention is to use this to generate a JSON file for the lab tool?

yep! @kaby76 has a bash script now.

Could trivially generate an index for the readme! I gen a list of dictionaries with info per grammar. Can gen markdown instead of json easily.

parrt avatar Oct 08 '22 01:10 parrt

It would be amazing to generate an index into the readme.md. That way we could take on the refactoring @KOLANICH has suggested and still enable people to find their favourite grammar from the main page. It would actually be a significant improvement over what we have today since, for example, there are assembler grammars under /asm which people may not realize are there.

teverett avatar Oct 08 '22 01:10 teverett

could a github action trigger a reindex too?

parrt avatar Oct 08 '22 02:10 parrt

well i've never done that. Perhaps.

teverett avatar Oct 08 '22 02:10 teverett

Yes, Github Actions could trigger a build of the index, but it would need to be somehow checked back into the tree without causing a whole new build/git ci ad infinitum. I'll try to play around with this.

kaby76 avatar Oct 08 '22 12:10 kaby76

Honestly, I don't think such description file is required at all. It's possible just to traverse directories and discover grammar files there. We have single pattern for grammar files and examples directory. Moreover, we have .pom files that also help (it contains info about root rule, examples directory and so on).

KvanTTT avatar Oct 08 '22 16:10 KvanTTT

@KvanTTT Well, I have an implementation that works off a generated grammars-v4.js file, which is equivalent to an index of the grammars-v4 repo in json format. If you can write an implementation of antlr4-lab that works over the wire, we can look at it and compare. Note, I was planning to make a separate implementation that gathers information over the internet from the poms, but I wanted to first follow through on this design.

kaby76 avatar Oct 08 '22 17:10 kaby76

I think we'll go with a generated index if we can. A casual user looking for a grammar is likely to make use of it, and perhaps less likely to click through directories searching.

@kaby76 full support for the GH Actions work you're doing.

teverett avatar Oct 08 '22 20:10 teverett

Plus I need something I can simply download from the antlr lab to get the files.

parrt avatar Oct 08 '22 20:10 parrt

I think I found the trick to do a check-in of the generated index file. See this PR. The current build doesn't kick it off because it's in a separate workflow (here). But, presumably once the workflow is added, it should work. I tested it on my own repo.

kaby76 avatar Oct 09 '22 23:10 kaby76

@kaby76 could it also generate markdown?

teverett avatar Oct 10 '22 02:10 teverett

@teverett Yes, we can add a script to build some markdown.

kaby76 avatar Oct 10 '22 09:10 kaby76

The indexer isn't quite working yet. Something's wrong with the workflow, even though it works on a semi-duplicated "grammars-v4" repo over in my github.com account (https://github.com/kaby76/temp-with-actions). https://github.com/antlr/grammars-v4/pull/2881

kaby76 avatar Oct 11 '22 13:10 kaby76

"git diff --quiet" returns 0 even if there are untracked files. Git bites me again.

kaby76 avatar Oct 11 '22 17:10 kaby76

The last change seemed to fix the problem with indexing, and we have an updated "grammars.json" file from @parrt 's indexer. Let's see how this works in lab.antlr.org.

kaby76 avatar Oct 12 '22 09:10 kaby76

  • The generated file from _scripts/mkindex.py contains entries where grammars are named by the declared name in the pom.xml (or maybe in the grammarDecl itself?). There are multiple grammars with the same names, e.g., the grammar at grammars-v4/javascript/javascript/ and the grammar at grammars-v4/javascript/jsx/ are both "JavaScript". How does one distinguish between the two except by looking at content, e.g., the Antlr4 grammars?
  • Only about 60 grammars are listed in the generated index. I'm not sure why. There should be over 200.
  • The index doesn't sort the grammar entries by "name".

kaby76 avatar Oct 12 '22 11:10 kaby76

I updated my fork of the Antlr lab to read the grammars.json file. Looks good. https://github.com/kaby76/antlr4-lab/tree/add-grammars-v4

You can see it in action on this droplet while it's up. http://134.209.209.215/

kaby76 avatar Oct 12 '22 13:10 kaby76

If we want to add information to the repo for programming language classification, reference links, or any other information, we're going to need to determine where to add it.

Add it to the readme.md, but make it more standardized

We could add the information for a grammar to the readme. Right now, there is no standardized format of what to document.

In the existing pom.xml, per grammar

The pom.xml could contain the information in the generation of the index, but it would have to be added carefully within the file. I tried to add a <classification>....</classification> under /project, and it was rejected. I then tried to place it under the first /project/build/plugins/plugin' and it too was rejected. I finally placed under /project/build/plugins/plugin/configurationelement and it finally worked, presumably because theantlr4-maven-pluginandantlr4test-maven-plugin` don't check spurious elements.

Alternatively, we could create a new plugin that doesn't little except for a place to nest information on the grammar within the pom.xml.

In a separate xml file, per grammar

We could add a new file with structured information for indexing.

In a global xml file

We could add grammar information in one big file in the root of the repo.

kaby76 avatar Oct 13 '22 13:10 kaby76

  • Only about 60 grammars are listed in the generated index. I'm not sure why. There should be over 200.

I noticed that as well but I had fairly strict constraints. If you take a look at the code you'll notice that it tosses out anything without an example I think or where there are more than two grammars. One I saw had a lexer, parser, and "hints" grammar. Strip those out because the ANTLR lab won't be able to handle those. Probably we need a flag on that indexer to generate some thing for the repository and something for the lab.

parrt avatar Oct 13 '22 16:10 parrt

There are multiple grammars with the same names, e.g., the grammar at grammars-v4/javascript/javascript/ and the grammar at grammars-v4/javascript/jsx/ are both "JavaScript".

rats. Would it be unique if we included the directories containing the grammar like javascript/JavaScript? Or, perhaps the grammar file names are unique?

parrt avatar Oct 13 '22 16:10 parrt

I updated my fork of the Antlr lab to read the grammars.json file.

Seems like we'd want the grammar.json file in this repo not the lab right? In other words the lab uses it but doesn't own it and doesn't have the code to generate it.

parrt avatar Oct 13 '22 16:10 parrt

The pom.xml could contain the information in the generation of the index

Seems to make sense to keep the classification or location within the ontology at the definition of the grammar and then we have a tool that pulls that information to create an index. We could also have multiple classifications or tags to create different kinds of indexes like Assembly code versus Data language versus high-level language etc...

parrt avatar Oct 13 '22 16:10 parrt