pyyaml
pyyaml copied to clipboard
Parsing of trailing TAB works differently for Python and C
This works properly (note the trailing TAB):
>>> from yaml import CLoader as Loader, CDumper as Dumper
>>> data = load('"bar"\t', Loader=Loader)
This fails:
>>> from yaml import Loader, Dumper
>>> data = load('"bar"\t', Loader=Loader)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 114, in load
return loader.get_single_data()
File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 49, in get_single_data
node = self.get_single_node()
File "/usr/lib/python3/dist-packages/yaml/composer.py", line 35, in get_single_node
if not self.check_event(StreamEndEvent):
File "/usr/lib/python3/dist-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/lib/python3/dist-packages/yaml/parser.py", line 142, in parse_implicit_document_start
if not self.check_token(DirectiveToken, DocumentStartToken,
File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 258, in fetch_more_tokens
raise ScannerError("while scanning for the next token", None,
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
in "<unicode string>", line 1, column 6:
"bar"
^
The short answer to your query is that in this case libyaml is right and pyyaml is wrong.
https://play.yaml.io/main/parser?input=ImJhciIJ shows the results of 14 YAML parsers, and PyYAML, Ruamel (fork of PyYAML) and SnakeYAML get this one wrong. The New Reference Parser there is literally generated from the spec productions and therefore is almost always correct in its interpretation. That might be a useful resource for you.
The productions involved are:
- https://yaml.org/spec/1.2.2/#rule-c-ns-flow-map-json-key-entry
- https://yaml.org/spec/1.2.2/#rule-c-flow-json-node
- https://yaml.org/spec/1.2.2/#rule-c-double-quoted
- https://yaml.org/spec/1.2.2/#rule-c-flow-json-node
- https://yaml.org/spec/1.2.2/#rule-s-separate
- https://yaml.org/spec/1.2.2/#rule-s-separate-in-line
- https://yaml.org/spec/1.2.2/#rule-s-white
- https://yaml.org/spec/1.2.2/#rule-s-separate-in-line
Which is spaces and tabs. Put another way, non-indentation whitespace is usually tabs and spaces.
Also re https://sourceforge.net/p/yaml/mailman/yaml-core/thread/CAHJtQJ4YE19fZS%2B7fGJ11P17w6P%2BPi27GcLXtdSv6L5uxeAofA%40mail.gmail.com/#msg37404600
In which you show the libyaml test suite not working, I was able to run this:
★ ~ $ git clone [email protected]:yaml/libyaml && (cd libyaml && ./bootstrap && ./configure && make test-suite)
...
ok 214 ZWK4: Key with anchor after missing explicit mapping value
1..214
ok
All tests successful.
Files=3, Tests=452, 7 wallclock secs ( 0.14 usr 0.00 sys + 6.22 cusr 2.08 csys = 8.44 CPU)
Result: PASS
make[1]: Leaving directory '/home/ingy/libyaml/tests/run-test-suite'
Hope that helps.
Note: I'll still be looking into improving the state of libyaml's testing.
It is not about right or wrong, it is about that the very same parser either succeeds of fails for the same YAML document. It means that an import in Python not only change the performance but significantly changes the functionality.
Ah but they are not the very same parser. They are the 2 distinctly different parsers that PyYAML contains. A pure Python one and libyaml. Note that libyaml was originally a direct port from PyYAML, written by the same person.
There are several known places where PyYAML using pure Python and PyYAML using libyaml differ. These are of course bugs, either in the Python code or libyaml. For the test case you posted, libyaml parses according to the spec, and PyYAML's python parser has a bug. That's why I said libyaml is right and PyYAML (the Python code) is wrong.
Note: It was my understanding that you were trying to find out how to interpret the spec, so that you could implement your SnakeYAML Java YAML parser correctly.
other repos have this bug because of pyyaml: https://github.com/docker/compose/issues/5662
@ingydotnet I think this use case is not defined in the test suite DE56 contains a lot of trailing TABS, but not this one