grammars-v4
grammars-v4 copied to clipboard
logo/logo grammar: some example inputs are not valid utf-8
trafficstars
While updating every template for testing (https://github.com/antlr/grammars-v4/pull/2989), I noticed that the Python3 target does not work on the logo/logo grammar, crashing on a file open:
../examples/logo_feature_butfail_2.txt
Traceback (most recent call last):
File "C:\Users\Kenne\test\grammars-v4\logo\logo\Generated\Program.py", line 105, in <module>
main(sys.argv)
File "C:\Users\Kenne\test\grammars-v4\logo\logo\Generated\Program.py", line 66, in main
str = FileStream(file_name, encoding);
File "C:\Users\Kenne\AppData\Local\Programs\Python\Python310\lib\site-packages\antlr4\FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File "C:\Users\Kenne\AppData\Local\Programs\Python\Python310\lib\site-packages\antlr4\FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
File "C:\Users\Kenne\AppData\Local\Programs\Python\Python310\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 188: invalid continuation byte
This is because the file it is trying to input is invalid utf-8 encoding.
01/13-17:19:45 /c/Users/Kenne/test/grammars-v4/logo/logo
$ iconv -f UTF-8 examples/logo_feature_butfail_2.txt
;In logo, characters in [] after print command should be output as a string,include blank space and unicode and number
print[hello world] ;should display "hello world"
print[hello
iconv: examples/logo_feature_butfail_2.txt:3:12: cannot convert
However, with C# and Java, the parser continues on the open. That's because, I think, the C# and Java code replaces the invalid code point with a replacement character.
The question is whether to allow invalid test files if the file itself contains invalid unicode. The problem is that how the parser responds to the error depends on the target, so results between targets cannot be tested for consistency.