grammars-v4 icon indicating copy to clipboard operation
grammars-v4 copied to clipboard

logo/logo grammar: some example inputs are not valid utf-8

Open kaby76 opened this issue 2 years ago • 0 comments
trafficstars

While updating every template for testing (https://github.com/antlr/grammars-v4/pull/2989), I noticed that the Python3 target does not work on the logo/logo grammar, crashing on a file open:

../examples/logo_feature_butfail_2.txt
Traceback (most recent call last):
  File "C:\Users\Kenne\test\grammars-v4\logo\logo\Generated\Program.py", line 105, in <module>
    main(sys.argv)
  File "C:\Users\Kenne\test\grammars-v4\logo\logo\Generated\Program.py", line 66, in main
    str = FileStream(file_name, encoding);
  File "C:\Users\Kenne\AppData\Local\Programs\Python\Python310\lib\site-packages\antlr4\FileStream.py", line 20, in __init__
    super().__init__(self.readDataFrom(fileName, encoding, errors))
  File "C:\Users\Kenne\AppData\Local\Programs\Python\Python310\lib\site-packages\antlr4\FileStream.py", line 27, in readDataFrom
    return codecs.decode(bytes, encoding, errors)
  File "C:\Users\Kenne\AppData\Local\Programs\Python\Python310\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 188: invalid continuation byte

This is because the file it is trying to input is invalid utf-8 encoding.

01/13-17:19:45 /c/Users/Kenne/test/grammars-v4/logo/logo
$ iconv -f UTF-8  examples/logo_feature_butfail_2.txt
;In logo, characters in [] after print command should be output as a string,include blank space and unicode and number
print[hello world]              ;should display "hello world"
print[hello
iconv: examples/logo_feature_butfail_2.txt:3:12: cannot convert

However, with C# and Java, the parser continues on the open. That's because, I think, the C# and Java code replaces the invalid code point with a replacement character.

The question is whether to allow invalid test files if the file itself contains invalid unicode. The problem is that how the parser responds to the error depends on the target, so results between targets cannot be tested for consistency.

kaby76 avatar Jan 13 '23 22:01 kaby76