grammars-v4
grammars-v4 copied to clipboard
[fortran] Add latest Fortran grammar, addressing issues in older versions
(This initial comment has been updated in order to clarify the topic. Replies to this comment may not make sense because description has changed.)
The grammars for Fortran need to be updated.
There are two Fortran grammars in this repository: fortran77 and fortran90. Unfortunately, while we could fix the old grammars, which might be of value as problems in one grammar often appear in other versions, there have been several major releases to Fortran since the 1990 Spec by the WG5.
In addition, it is not clear whether the fortran77 and fortran90 grammars are derived from the published spec or from the draft just prior to publication, and what those differences may be.
The ISO charges around $200 per copy of the downloadable PDF, and I doubt that anyone is really working directly from the published specs. The ISO Specs are copyrighted material. The PDFs are licensed to a specific individual or institutional use, with "watermarks" as to the identity of the licensee. If we plan on maintaining grammars for Fortran (or any of the other ISO programming language standards for that matter), we will need to figure out how to get regular access.
Entering in CFGs by hand from the specs is error prone, untrustworthy, and not repeatable. The procedure should be automated. The Trash Toolkit should be used to automatically scrape and refactor the CFG from any version of the Fortran spec, and added to the directory https://github.com/antlr/grammars-v4/tree/6581f29d0cb63e3e337cd6dacec6602e34aa88d3/fortran. Any bugs in the grammar can be fixed, but these changes should be addressed in the scrape/refactor procedure as well.
I am learning.
useStmt
: USE NAME
| USE NAME COMMA ONLY COLON
| USE NAME COMMA renameList
| USE NAME COMMA ONLY COLON onlyList
| USE COMMA INTRINSIC DOUBLECOLON NAME
;
though I appreciate your idea about moving to separate grammars to match the spec.
I re-added tritext to Trash and added the -m option, which will add <i>...</i> and <b>...</b> tagging around grammar rules where it appears in the Fortran spec. This way one can scrape the grammar from a spec and determine whether the symbol in the grammar is a non-terminal or terminal or grammar punctuation, e.g., is, where the is is boldface so as not mistake it for anything but the LHS/RHS rule separator.
I now have a scraper implementation that extracts all the rules from the official spec ISO/IEC 1539-1:2023. Note, the spec costs ~$200, which I purchased. It seems some rules are missing from draft versions of the spec.
The scraper is done in two steps.
The first step calls tritext of the .pdf to extract all the text from the .pdf. The tool implements text indentation as seen in the spec. This is critical, because the spec does not have rule terminators, like the Antlr4 ';'. So, in order to distinguish between additional lines for the rule from non-rule text or another rule following the current rule, any lines that are part of the rule are indented as we see in the spec.
So, instead of:
R601 alphanumeric-character is letter
or digit
or underscore
Except for the currency symbol, the graphics used for the characters shall be as given in 6.1.2, 6.1.3, 6.1.4, and
6.1.5. However, the style of any graphic is not specified.
...
the tool outputs:
R601 alphanumeric-character is letter
or digit
or underscore
Except for the currency symbol, the graphics used for the characters shall be as given in 6.1.2, 6.1.3, 6.1.4, and
6.1.5. However, the style of any graphic is not specified.
...
We can now "see" the end of the rule with the text "Except for the ...." because the text is flush with the left-hand margin.
The second step is a program to pull out the rules from this extracted text of the spec. The code for that program is:
using System.IO;
using System.Text.RegularExpressions;
public class Program
{
static void Main(string[] args)
{
string line;
Regex rs = new Regex(@"^R\d");
Regex re = new Regex(@"^[ ]");
bool do_print = false;
while ((line = Console.ReadLine()) != null)
{
if (rs.IsMatch(line))
{
do_print = true;
Console.WriteLine(line);
}
else if (do_print && re.IsMatch(line))
{
Console.WriteLine(line);
}
else
{
do_print = false;
}
}
}
}
I tried to use sed and/or awk, but I found the patterns too difficult to write, and ended up with just coding it in C#.
The script to extract the rules is thus:
tritext ISO_IEC_1539-1_2023\(en\).pdf | ConsoleApp.exe
As the spec is copyrighted, I cannot post the extracted grammar here as is. The plan is to the use Trash to mutate the syntax and transform the grammar into Antlr4 syntax, which can be posted.
I have spent a great deal of time trying to use ChatGPT to extract the rules from the text. ChatGPT can do it, but it is extremely slow, and requires constant prompting to "continue, please" to get to the end of the entire spec. Copilot does not work at all because quotes and slashes interfere with the prompt questioning. However, LLMs are ideal for scraping because they solve "feature extraction". I have not pursued this solution further.
Thank you @kaby76. Extraordinary effort stymied by the Fortran vendors and ISO-publication revenue stream.
Is there a 'support my efforts' link for you that the community could help with the costs of the grammar?
@kaby76 @skelter count me in, I'll send a couple dollars.
Additional conversations related to this issue.
- https://stackoverflow.com/questions/78596591/lexing-issue-in-antlr4-grammar-for-fortran-2018-token-misclassification#comment138585588_78596591
- https://github.com/antlr/antlr4/issues/4640
Hi everyone,
I wanted to share that my college has access to the latest Fortran specification: ISO/IEC 1539-1 Fifth edition 2023-11. I've managed to extract the rules in a format suitable for Antlr4 from the specification. Currently, I'm working on the grammar in a step-by-step process, adding rules incrementally and testing them. I'm about one-third of the way through the code.
As part of this effort, I'm also creating a test Fortran files to verify the grammar implementation. If I encounter any issues during this process, I'll post them on Stack Overflow and GitHub with a minimal working example for discussion and resolution.
Based on my progress, I anticipate that the complete grammar might be available in some weeks, probably within a month.
It looks like the last draft spec that is available prior to the published 2018 version is https://j3-fortran.org/doc/year/18/18-007r1.pdf. That doc contains section and line numbering in the left margin, which presumably was used by the WG5 to help identify where to make corrections to a draft. tritext will need to be updated to remove this junk. Update June 6 '24: I added an option to the tritext pdf reader to filter these out. Again, it would be better to do all this using a LLM, as "Feature Extraction" is exactly what LLMs excel at. I have not released the latest of the Trash Toolkit (v0.23.1) because of regressions in trgen.
Older specs like the final draft spec for Fortran 1990, https://wg5-fortran.org/N001-N1100/N692.pdf, don't contain any text, so the PDF will need to be OCR'ed first in order to extract text.
Attached here is the rule extraction for Fortran 2018 from the last available draft at https://j3-fortran.org/doc/year/18/18-007r1.pdf using the tritext (latest not available yet) and the above program.
This text is exactly character for character all the "R..." rules in the PDF, along with HTML markup for bold and italics.
This is successfully parsed using a custom grammar for "WG5 EBNF", which I wrote.
I'm now in position to refactor this to Antlr4, all using a repeatable, automated manner.
I have developed the Fortran 2023 Grammar based on the latest Fortran specification: ISO/IEC 1539-1 Fifth edition 2023-11. You can access the grammar in my Github: Fortran2023Grammar. Please let me know if you find any errors.
I have developed the Fortran 2023 Grammar based on the latest Fortran specification: ISO/IEC 1539-1 Fifth edition 2023-11. You can access the grammar in my Github: Fortran2023Grammar. Please let me know if you find any errors.
Thanks, I will look it over.
You should rename your grammar files with the ".g4" extension. The file extension is extensively assumed in scripts, Github, etc.
On the issue of statement labels, I would adjust any parser rules to recognize the label. This is because it will be easier to write an XPath expression for both the defining occurrence of a label (i.e., the label that occurs before any statement) and the applied occurrence of a label (e.g., in "go to 100"). In addition, you won't need to change parser rule label into a special token called LABEL.
I encountered some grammar issues while parsing:
- Is the
NAMErule correctly placed in the lexer, or should it be written as a parser rule, likename: LETTER (ALPHANUMERICCHARACTER)*? It is causing some problems with the some rules, like:
R865 letter-spec -> letter [- letter]as every letter is tokenised as NAME.
- Placing the
DIGITSTRINGrule before theDIGITrule in the lexer causes all single-digit numbers to be tokenized asDIGITSTRINGinstead ofDIGIT. This creates issue for the label rule:
R611 label -> digit [digit [digit [digit [digit]]]]
The grammar could not parse the following correctly:
-
101 format ( F9.2 )as format-stmt, even after adding the label, becauseF9is tokenized asNAMEinstead ofFandDIGITSTRING. -
a-bas it gets tokenized asLETTERSPECinstead ofNAME,MINUS,LETTER.
@kaby76 Could you include the grammar in grammers-v4. There are minor issues, but they can be resolved with everyone's involvement.
Status: I haven't forgotten this; I have been working on scraping. There is quite a bit to do still. https://github.com/kaby76/fortran
I have developed the Fortran 2023 Grammar based on the latest Fortran specification: ISO/IEC 1539-1 Fifth edition 2023-11. You can access the grammar in my Github: Fortran2023Grammar. Please let me know if you find any errors.
Can you please add this grammer in grammers-v4
Can you please add this grammer in grammers-v4
Sorry, but I am not ready.
The script that I wrote for scraping is producing some different rules than what you translated from the spec. I can't give you details because I didn't write the down these rules. But, they seem significant. For example, for typeAttrSpecList: typeParamAttrSpec (COMMA typeParamAttrSpec)*; (https://github.com/AkhilAkkapelli/Fortran2023Grammar/blob/553123a023f70e9a524e2a4036be128978834c42/Fortran2023Parser.g4#L811), you use typeParamAttrSpec instead of typeAttrSpec as in the spec, and dropped defining rule R728 type-attr-spec altogether. And, clearly, type-attr-spec and type-param-attr-spec recognize very different strings.
The order of the parser rules in your grammar is not what was in the spec, which makes comparison more difficult. I have a grammar comparison script, but it is on an older version of Trash. Your grammar includes useless parentheses in many rules, which have to be removed in order to compare your grammar with the one I am generating.
The scraper adds in rules that you defined in your lexer grammar.
But, I'm not a fan of some naming conventions you chose (camel case vs, snake case vs. uppercase), and the formatting isn't according to the coding standard for this repo.
The plan is to attribute both you and me on this, but I really want this to be auto-generated so we can add in all the previous versions as well.
I was noticing a couple of Antlr tool problems with the generated grammar (e.g., there is a rule O: 'O';, but it is not declared in the generated .tokens file, which likely means a bug in the Antlr tool.
So, I decided to go back you your grammar, and check what the Antlr tool does with that. The grammar doesn't quite build.
- The file name and grammar decl name have to agree.
$ antlr4 -Dlanguage=CSharp Fortran2023Lexer.g4 Fortran2023Parser.g4
error(8): Fortran2023Lexer.g4:1:14: grammar name FortranF23Lexer and file name Fortran2023Lexer.g4 differ
Fixing that, I still see that the Antlr tool still doesn't add the string literal 'O'=somenumber. I can work around by doing a fold transformation on the string literal in the parser grammar. But, it should be reported in https://github.com/antlr/antlr4.
But, there are closures taken on rules that can derive empty.
$ antlr4 -Dlanguage=CSharp Fortran2023Parser.g4
warning(154): Fortran2023Parser.g4:1016:0: rule derivedTypeDef contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:1054:0: rule interfaceBody contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:1054:0: rule interfaceBody contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:2163:0: rule blockConstruct contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:2415:0: rule mainProgram contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:2447:0: rule functionSubprogram contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:2465:0: rule subroutineSubprogram contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:2485:0: rule separateModuleSubprogram contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:2502:0: rule module contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:2523:0: rule submodule contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2023Parser.g4:2535:0: rule blockData contains an optional block with at least one alternative that can match an empty string
These are "warnings" according to the Antlr tool, but they really should be considered errors as they impact performance.
I am getting closer. I now have to correct the warnings, and fix mutual left-recursion.
$ rm -f *.tokens *.interp; !an
rm -f *.tokens *.interp; antlr4 -Dlanguage=CSharp FortranLexer.g4 FortranParser.g4
warning(154): FortranParser.g4:30:1: rule main_program contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:37:1: rule function_subprogram contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:42:1: rule subroutine_subprogram contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:47:1: rule module contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:51:1: rule submodule contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:55:1: rule block_data contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:97:1: rule separate_module_subprogram contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:293:1: rule derived_type_def contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:717:1: rule block_construct contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:1090:1: rule interface_body contains an optional block with at least one alternative that can match an empty string
warning(154): FortranParser.g4:1090:1: rule interface_body contains an optional block with at least one alternative that can match an empty string
error(119): FortranParser.g4::: The following sets of rules are mutually left-recursive [designator, complex_part_designator, array_section] and [variable, scalar_variable, proc_component_ref, procedure_designator, function_reference] and [expr] and [level_5_expr] and [equiv_operand] and [or_operand] and [level_3_expr] and [level_2_expr] and [add_operand]
07/19-08:43:50 ~/fortran
I've fixed the script to replace the rules with indirect left recursion, added steps to remove useless parentheses and reformat. The generated grammar works surprisingly well, parsing all the fortran90 examples. I will need to scrounge up a good testsuite from the internet. And I will need to add in as comments the "R" number of the rule and original rule from the spec.
Unfortunately, upon further examination, the auto-scraped grammar is not only very, very slow, but fails to parse quite a few files in the examples/ directory. The situation is similar for Fortran2023Grammar, which I altered slightly to include a proper EOF-terminated start rule, and renamed the .g4's to have the Antlr4 Tool process them. The Antlr4 Tool requires the name of the file to be identical to the name of the grammar, which it is not. Without an EOF-terminated start rule, which is missing in Fortran2023Grammar, Antlr will stop the parse at an error, back up to a point before the error, and report success. See https://groups.google.com/g/antlr-discussion/c/p6b4Sc70X5s/m/E4B7rD95AAAJ.
The fortran90 grammar parses all 303 .f90 files in about 7s. The grammar at https://github.com/AkhilAkkapelli/Fortran2023Grammar (corrected and with tests added at https://github.com/kaby76/Fortran2023Grammar) parses 243 files, and fails on 60 files. The auto-scraped grammar at https://github.com/kaby76/fortran parses parses 237, and fails on 66 files.
To compare the speed, all tests that failed to parse need to be removed from performance testing because error recovery in Antlr4 is terrible. Of the 237 files that all three grammars pass, the runtimes are:
| fortran90 | Fortran2023Grammar | auto-scraped |
|---|---|---|
| 3s | 62s | 523s |
(Raw data: f90.txt a.txt k.txt files.txt )
Neither the auto-scraped grammar nor Fortran2023Grammar are ready to be added to the repo.
(What is really interesting is the disparity in the performance between auto-scraped grammar and Fortran2023Grammar. I suspect this is caused because of ambiguity or fall backs to full stack context because of some deleted rules in Fortran2023Grammar. I am still working on porting a few of the old tools in Trash for ungroup grammar refactorings and grammar comparison.)
I wrote the grammar compare script. It wasn't hard to do.
The script uses Trash.
#!/bin/sh
# Sort parser rules and reformat. Assumes useless parentheses already
# removed.
cp ../FortranParser.g4 .
trparse -t ANTLRv4 FortranParser.g4 | trsort | trsponge -c
antlr-format -c ../../repo_coding_style.json FortranParser.g4
# Remove useless parentheses, rename to Snake case, delete comments,
# and reformat.
cp ../../Fortran2023Grammar/Fortran2023Parser.g4 .
trparse -t ANTLRv4 FortranParser.g4 | \
trquery grep ' //parserRuleSpec/RULE_REF' | trtext | sort -u > after.txt
sed 's/_\([a-z]\)/\U\1/g' after.txt > before.txt
paste -d "," before.txt after.txt > rename.txt
bash /c/Users/Kenne/Documents/Github/g4-scripts/delete-useless-parentheses.sh Fortran2023Parser.g4
trparse -t ANTLRv4 Fortran2023Parser.g4 | \
trrename -R rename.txt | \
trquery delete ' //(@DOC_COMMENT | @BLOCK_COMMENT | @LINE_COMMENT)' | \
trsponge -c
antlr-format -c ../../repo_coding_style.json Fortran2023Parser.g4
trparse -t ANTLRv4 Fortran2023Parser.g4 | trsort | trsponge -c
antlr-format -c ../../repo_coding_style.json Fortran2023Parser.g4
diff FortranParser.g4 Fortran2023Parser.g4
Basically, the script converts both auto-scraped grammar and Fortran2023Grammar to a "normal form" then uses Bash's CLI diff. For Fortran2023Grammar, the names of the parser symbols are renamed using Snake case. The script uses:
- delete-useless-parentheses.sh, which is a Trash script to delete useless parentheses.
- trsort, which I just re-added in Trash.
- antlr-format to put the grammar in the repo's coding standard format.
It turns out I made a typo in one of the refactorings. And, it looks like the rule R933 was misread from the Spec in @AkhilAkkapelli 's Fortran2023Parser.g4. The rule should have been:
R933 allocation is allocate-object [ ( allocate-shape-spec-list ) ] [ lbracket allocate-coarray-spec rbracket ]
or ( [ lower-bounds-expr : ] upper-bounds-expr ) [ lbracket allocate-coarray-spec rbracket ]
Consequently, the parser rule in the Antlr grammar was missing an LPAREN/RPAREN in the second alt. But, this is exactly why grammars should be auto-scraped, not hand-scraped. procComponentRef does not look correct either. There are others.
Correcting the diffs fixes the really bad performance of auto-scraped-parser. It now runs in the same time as @AkhilAkkapelli 's grammar.
It looks like the main reason why the spec grammar is so slow is because of full context fallbacks for primary. Here is the performance trace for parsing intrinsic_pack.f90, which takes an outrageous 21s. cover.html.txt perf.txt. There is no ambiguity, which is great. But, solving full context fallbacks is harder to fix, usually requiring rule duplication and substitution to separate the conflicting rule transitions.
I added a tool to Trash clone and rename certain instances of a parser rule symbol on the right-hand side of a rule. In some grammars, like the example out of the ALL(*) tech report paper, the tool can optimize the grammar to remove fallback to full parser context.
I then applied it to a half dozen occurrences of expr in the FortranParser.g4 grammar.
trparse -t ANTLRv4 FortranParser.g4 | trclonereplace " //parserRuleSpec[RULE_REF/text()='int_expr']//RULE_REF[text()='expr']" -s "_1" | trsponge -c
trparse -t ANTLRv4 FortranParser.g4 | trclonereplace " //parserRuleSpec[RULE_REF/text()='numeric_expr']//RULE_REF[text()='expr']" -s "_2" | trsponge -c
trparse -t ANTLRv4 FortranParser.g4 | trclonereplace " //parserRuleSpec[RULE_REF/text()='constant_expr']//RULE_REF[text()='expr']" -s "_3" | trsponge -c
trparse -t ANTLRv4 FortranParser.g4 | trclonereplace " //parserRuleSpec[RULE_REF/text()='logical_expr']//RULE_REF[text()='expr']" -s "_4" | trsponge -c
trparse -t ANTLRv4 FortranParser.g4 | trclonereplace " //parserRuleSpec[RULE_REF/text()='data_target']//RULE_REF[text()='expr']" -s "_5" | trsponge -c
Unfortunately, these clone/replace rules did not change the fallback counts.
I have another idea to try, based on https://github.com/antlr/grammars-v4/issues/4181, and "shift/reduce" conflicts in a grammar.
It looks like the best way to go about fixing the scraped grammar is to open the debugger for a particularly bad performing example (examples/flang/test/f90_correct/src/intrinsic_pack.f90), and stop at the first occurrence of a full context fallback, then read out the conflicting alts and rule. You can then solve this usually by eliminating some alts that overlap, rebuild, and continue debugging.
I have now a collection of rules that are problems, with a couple of fixes. https://github.com/kaby76/fortran/issues/1
Sounds laborious. I wonder what patterns you'll find across the long haul.
Sounds laborious. I wonder what patterns you'll find across the long haul.
So far, for entity_decl, array_spec, and designator, the problem is AdaptivePredict() reporting an "ambiguity" when choosing an alt. Strangely, this gets reported as a full-context fallback, not ambiguity. The Antlr Runtime reports no ambiguity via an ErrorListener. It does not make sense.
That said, the three rules are easy to fix and can be automated.