pyyaml icon indicating copy to clipboard operation
pyyaml copied to clipboard

Dots in anchor names cause parsing errors

Open Infernio opened this issue 5 years ago • 8 comments

If I make a file test.yaml with these contents:

- &my.anchor
  - key: 'foo'
    value: 'bar'

And then parse the contents using PyYAML, I get a traceback:

>>> import yaml
>>> yaml.safe_load(open('test.yaml'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\yaml\__init__.py", line 162, in safe_load
    return load(stream, SafeLoader)
  File "C:\Python27\lib\site-packages\yaml\__init__.py", line 114, in load
    return loader.get_single_data()
  File "C:\Python27\lib\site-packages\yaml\constructor.py", line 66, in get_single_data
    node = self.get_single_node()
  File "C:\Python27\lib\site-packages\yaml\composer.py", line 36, in get_single_node
    document = self.compose_document()
  File "C:\Python27\lib\site-packages\yaml\composer.py", line 55, in compose_document
    node = self.compose_node(None, None)
  File "C:\Python27\lib\site-packages\yaml\composer.py", line 82, in compose_node
    node = self.compose_sequence_node(anchor)
  File "C:\Python27\lib\site-packages\yaml\composer.py", line 110, in compose_sequence_node
    while not self.check_event(SequenceEndEvent):
  File "C:\Python27\lib\site-packages\yaml\parser.py", line 98, in check_event
    self.current_event = self.state()
  File "C:\Python27\lib\site-packages\yaml\parser.py", line 379, in parse_block_sequence_first_entry
    return self.parse_block_sequence_entry()
  File "C:\Python27\lib\site-packages\yaml\parser.py", line 384, in parse_block_sequence_entry
    if not self.check_token(BlockEntryToken, BlockEndToken):
  File "C:\Python27\lib\site-packages\yaml\scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "C:\Python27\lib\site-packages\yaml\scanner.py", line 231, in fetch_more_tokens
    return self.fetch_anchor()
  File "C:\Python27\lib\site-packages\yaml\scanner.py", line 621, in fetch_anchor
    self.tokens.append(self.scan_anchor(AnchorToken))
  File "C:\Python27\lib\site-packages\yaml\scanner.py", line 936, in scan_anchor
    % ch.encode('utf-8'), self.get_mark())
yaml.scanner.ScannerError: while scanning an anchor
  in "test.yaml", line 1, column 3
expected alphabetic or numeric character, but found '.'
  in "test.yaml", line 1, column 6

The same thing happens with the C version:

>>> import yaml
>>> yaml.load(open('test.yaml'), Loader=yaml.CSafeLoader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\yaml\__init__.py", line 114, in load
    return loader.get_single_data()
  File "C:\Python27\lib\site-packages\yaml\constructor.py", line 66, in get_single_data
    node = self.get_single_node()
  File "ext\_yaml.pyx", line 707, in _yaml.CParser.get_single_node
  File "ext\_yaml.pyx", line 725, in _yaml.CParser._compose_document
  File "ext\_yaml.pyx", line 774, in _yaml.CParser._compose_node
  File "ext\_yaml.pyx", line 849, in _yaml.CParser._compose_sequence_node
  File "ext\_yaml.pyx", line 905, in _yaml.CParser._parse_next_event
yaml.scanner.ScannerError: while scanning an anchor
  in "test.yaml", line 1, column 3
did not find expected alphabetic or numeric character
  in "test.yaml", line 1, column 6

Looking at PyYAML's source code, scanner.py contains the following note:

# The specification does not restrict characters for anchors and
# aliases. This may lead to problems, for instance, the document:
#   [ *alias, value ]
# can be interpreted in two ways, as
#   [ "value" ]
# and
#   [ *alias , "value" ]
# Therefore we restrict aliases to numbers and ASCII letters.

But an exception is made for underscores and minuses:

while u'0' <= ch <= u'9' or u'A' <= ch <= u'Z' or u'a' <= ch <= u'z'    \
        or ch in u'-_':

Would it be possible to add dots to the exceptions as well, or could that cause parsing ambiguities?

Infernio avatar Jan 11 '20 17:01 Infernio

I agree that the dot should be allowed, and it should not cause any ambiguity. Also the forward slash should be added IMHO. Also libyaml should be changed accordingly.

perlpunk avatar Jan 13 '20 09:01 perlpunk

Need this please.

vaibhavparnalia avatar Aug 11 '20 20:08 vaibhavparnalia

Please fix this issue. Other YAML interpreters handle this without errors.

MartinDevillers avatar Jul 05 '23 18:07 MartinDevillers

PyYAML (and possibly LibYAML?) is not compliant with the spec. I reviewed both YAML 1.0 and YAML 1.2.2 as published on yaml.org and dots along with most other non-breaking, non-space Unicode characters are allowed for for anchor and alias tags. These should be allowed and are used in the wild.

penguin359 avatar Jan 08 '24 10:01 penguin359

I believe PyYAML claims to be YAML 1.1 compliant so here is the relevant portion of the spec for their syntax:

https://yaml.org/spec/1.1/#ns-anchor-name

c-ns-anchor-property ::= “&” ns-anchor-name
ns-anchor-name ::= ns-char+
ns-char ::= nb-char - s-white
nb-char ::= c-printable - b-char
c-printable ::= #x9 | #xA | #xD | [#x20-#x7E] /* 8 bit */
    | #x85 | [#xA0-#xD7FF] | [#xE000-#xFFFD]  /* 16 bit */
    | [#x10000-#x10FFFF]                      /* 32 bit */
b-char ::= b-line-feed | b-carriage-return | b-next-line | b-line-separator | b-paragraph-separator
b-line-feed ::= #xA /*LF*/
b-carriage-return ::= #xD /*CR*/
b-next-line ::= #x85 /*NEL*/
b-line-separator ::= #x2028 /*LS*/
b-paragraph-separator ::= #x2029 /*PS*/
s-white ::= #x9 /*TAB*/ | #x20 /*SP*/

penguin359 avatar Jan 08 '24 10:01 penguin359

Ran into this issue today, we also were using '.' as a namespacing separator and would've liked to use it in our anchors.

The most relevant PR in this thread seems to be https://github.com/yaml/libyaml/pull/170, which fixed the issue in libyaml, but then was later reverted. Is this issue indefinitely in limbo until something in the yaml spec clarifies allowed characters?

Is there any way in pyyaml you'd recommend locally overriding this behavior? e.g extending the Scanner class and overriding scan_anchor (https://github.com/yaml/pyyaml/blob/main/lib/yaml/scanner.py#L899)?

matthewgrossman avatar Mar 12 '24 04:03 matthewgrossman

@matthewgrossman As I mentioned in my earlier comment, the YAML spec is very clear that dot is an allowed character in multiple versions of the spec. This is clearly a bug in PyYAML and/or LibYAML. I've since switched to ruamel.yaml as my Python YAML library since it is being regularly updated, supports YAML 1.2.x unlike this package, and has the dot character issue fixed. It also features an API similar to PyYAML and only needing a few tweaks in my experience.

penguin359 avatar Mar 12 '24 05:03 penguin359

Thank for the response, agreed that yaml 1.2 does allow many more characters, and that's related to why it was reverted in the first place: https://github.com/yaml/libyaml/pull/170#issuecomment-605720525. It seems a variety of contributors are convinced the over-permissivity of allowed anchor-characters in 1.2 could cause problems for libraries like pyyaml/libyaml, so opted to not add more allowed characters until something was more formalized in future specs (other discussion thread). The proposed RFC now 404s for me (https://github.com/yaml/yaml-spec/blob/main/rfc/RFC-0003.md) and discussion was around formalizing for yaml 1.3 anyways 😕

I'll look into using ruamel.yaml instead, thanks for the pointer @penguin359

matthewgrossman avatar Mar 12 '24 05:03 matthewgrossman