bnfc icon indicating copy to clipboard operation
bnfc copied to clipboard

Support block comment delimiters > 2 characters in Haskell backend

Open andreasabel opened this issue 8 years ago • 5 comments

comment "--(" "--)";

reports

Warning: comment delimiters longer than 2 characters ignored in Haskell:
--( - --)

andreasabel avatar Feb 21 '17 10:02 andreasabel

The C/Java backends (?Lex) did not support more than one block comment form. This has been fixed now by introducing different start conditions COMMENT, COMMENT1, ... for different block comment forms so that they do not get mixed up.

Further, single line comments should not be expected to be terminated by a newline character, as it could also be the EOF character.

andreasabel avatar Dec 15 '19 21:12 andreasabel

For Haskell/Java, I implemented in 83681fa9893728e35277fe9987a6950fc3851573 a translation from block comment terminator strings to regular expressions that works correctly for C-style comments /* ... */ and HTML-style comments <!-- -->. (The previous solution 799ab5fb80d3b40b860ceab1f058916afefa845a used a translation that sat there in BNFC.Lexing that could not deal with block comment correctly.)

However, it fails when the terminator string has non-trivial repetitions, e.g., ananas. Here the generated regular expression skips over the terminator e.g. in case of anananas.

Prg. Program ::= [Integer];
terminator Integer "";

comment "banana" "ananas";

Input:

0
banana anananas
1
banana ananas
2

Output:

[Abstract Syntax]
Prg [0,2]

[Linearized tree]
0 2

The correct lexing of anananas would, after scanning anana and seeing nas, jump to have seen anan and looking at as.

This is a well-known problem in recognizing a search word of length m in a text of length n in O(n+m) time, and solutions are implemented in the lexer generators we target. Thus, maybe we should not try to reinvent the wheel.
It seems that going via regular expressions for this problem is anyway expensive, using lexer states or start conditions (such as used for lexing comments with flex and JLex have) might be better also when targeting Alex.

andreasabel avatar Dec 17 '19 17:12 andreasabel

In the end, I did reinvent the wheel and generate a (sometimes huge) regex for recognizing the end of a block comment: 4e57e8efa5190ae57337ed8f8a3a5cd4f4a423a6.
The reason is that in Alex, unlike flex, JLex etc., start conditions do not work out of the box, but need the monad wrapper. It seemed more work to change this than solving the puzzle how to generate the regex. (To my surprise, my solution worked on the first try, but I spent hours to formulate it elegantly using foldl/r rather than hacking it in with arrays and indices.)

Caveat: This regex can be too big for ocamllex to handle, which generates automata with max 2^15 transitions. It might be a problem with ocamllex as the resulting DFA should be small. Also, there might be a better way to do this with ocamllex.

Alex can a bit slow turning this regex into a DFA. I wonder whether there are faster algorithms that can deal with large regexes; the intermediate NFA Alex creates seems huge.

andreasabel avatar Dec 18 '19 16:12 andreasabel

The remaining problem (see #280) is in the OCaml backend: ocamllex cannot stomach the generated regexes, creating too big automata. Filed bug at https://github.com/ocaml/ocaml/issues/9964.

Alex could be faster, reported https://github.com/simonmar/alex/issues/163.

andreasabel avatar Oct 08 '20 13:10 andreasabel

Xavier Leroy recommends to use ocamllex states: https://github.com/ocaml/ocaml/issues/9964#issuecomment-705721268

andreasabel avatar Oct 12 '20 09:10 andreasabel