pyyaml icon indicating copy to clipboard operation
pyyaml copied to clipboard

Parsing of trailing TAB works differently for Python and C

Open asomov opened this issue 3 years ago • 5 comments

This works properly (note the trailing TAB):

>>> from yaml import CLoader as Loader, CDumper as Dumper
>>> data = load('"bar"\t', Loader=Loader)

This fails:

>>> from yaml import Loader, Dumper
>>> data = load('"bar"\t', Loader=Loader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 114, in load
    return loader.get_single_data()
  File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 35, in get_single_node
    if not self.check_event(StreamEndEvent):
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 142, in parse_implicit_document_start
    if not self.check_token(DirectiveToken, DocumentStartToken,
  File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 258, in fetch_more_tokens
    raise ScannerError("while scanning for the next token", None,
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
  in "<unicode string>", line 1, column 6:
    "bar"	
         ^

asomov avatar Dec 19 '21 11:12 asomov

The short answer to your query is that in this case libyaml is right and pyyaml is wrong.

https://play.yaml.io/main/parser?input=ImJhciIJ shows the results of 14 YAML parsers, and PyYAML, Ruamel (fork of PyYAML) and SnakeYAML get this one wrong. The New Reference Parser there is literally generated from the spec productions and therefore is almost always correct in its interpretation. That might be a useful resource for you.

The productions involved are:

  • https://yaml.org/spec/1.2.2/#rule-c-ns-flow-map-json-key-entry
    • https://yaml.org/spec/1.2.2/#rule-c-flow-json-node
      • https://yaml.org/spec/1.2.2/#rule-c-double-quoted
  • https://yaml.org/spec/1.2.2/#rule-s-separate
    • https://yaml.org/spec/1.2.2/#rule-s-separate-in-line
      • https://yaml.org/spec/1.2.2/#rule-s-white

Which is spaces and tabs. Put another way, non-indentation whitespace is usually tabs and spaces.

ingydotnet avatar Dec 20 '21 17:12 ingydotnet

Also re https://sourceforge.net/p/yaml/mailman/yaml-core/thread/CAHJtQJ4YE19fZS%2B7fGJ11P17w6P%2BPi27GcLXtdSv6L5uxeAofA%40mail.gmail.com/#msg37404600

In which you show the libyaml test suite not working, I was able to run this:

★ ~ $ git clone [email protected]:yaml/libyaml && (cd libyaml && ./bootstrap && ./configure && make test-suite)
...
ok 214 ZWK4: Key with anchor after missing explicit mapping value
1..214
ok
All tests successful.
Files=3, Tests=452,  7 wallclock secs ( 0.14 usr  0.00 sys +  6.22 cusr  2.08 csys =  8.44 CPU)
Result: PASS
make[1]: Leaving directory '/home/ingy/libyaml/tests/run-test-suite'

Hope that helps.

Note: I'll still be looking into improving the state of libyaml's testing.

ingydotnet avatar Dec 20 '21 18:12 ingydotnet

It is not about right or wrong, it is about that the very same parser either succeeds of fails for the same YAML document. It means that an import in Python not only change the performance but significantly changes the functionality.

asomov avatar Dec 21 '21 05:12 asomov

Ah but they are not the very same parser. They are the 2 distinctly different parsers that PyYAML contains. A pure Python one and libyaml. Note that libyaml was originally a direct port from PyYAML, written by the same person.

There are several known places where PyYAML using pure Python and PyYAML using libyaml differ. These are of course bugs, either in the Python code or libyaml. For the test case you posted, libyaml parses according to the spec, and PyYAML's python parser has a bug. That's why I said libyaml is right and PyYAML (the Python code) is wrong.

Note: It was my understanding that you were trying to find out how to interpret the spec, so that you could implement your SnakeYAML Java YAML parser correctly.

ingydotnet avatar Dec 21 '21 14:12 ingydotnet

other repos have this bug because of pyyaml: https://github.com/docker/compose/issues/5662

earonesty avatar Feb 17 '22 21:02 earonesty

@ingydotnet I think this use case is not defined in the test suite DE56 contains a lot of trailing TABS, but not this one

asomov avatar Jan 17 '23 10:01 asomov