pyyaml
pyyaml copied to clipboard
Dots in anchor names cause parsing errors
If I make a file test.yaml
with these contents:
- &my.anchor
- key: 'foo'
value: 'bar'
And then parse the contents using PyYAML, I get a traceback:
>>> import yaml
>>> yaml.safe_load(open('test.yaml'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\yaml\__init__.py", line 162, in safe_load
return load(stream, SafeLoader)
File "C:\Python27\lib\site-packages\yaml\__init__.py", line 114, in load
return loader.get_single_data()
File "C:\Python27\lib\site-packages\yaml\constructor.py", line 66, in get_single_data
node = self.get_single_node()
File "C:\Python27\lib\site-packages\yaml\composer.py", line 36, in get_single_node
document = self.compose_document()
File "C:\Python27\lib\site-packages\yaml\composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "C:\Python27\lib\site-packages\yaml\composer.py", line 82, in compose_node
node = self.compose_sequence_node(anchor)
File "C:\Python27\lib\site-packages\yaml\composer.py", line 110, in compose_sequence_node
while not self.check_event(SequenceEndEvent):
File "C:\Python27\lib\site-packages\yaml\parser.py", line 98, in check_event
self.current_event = self.state()
File "C:\Python27\lib\site-packages\yaml\parser.py", line 379, in parse_block_sequence_first_entry
return self.parse_block_sequence_entry()
File "C:\Python27\lib\site-packages\yaml\parser.py", line 384, in parse_block_sequence_entry
if not self.check_token(BlockEntryToken, BlockEndToken):
File "C:\Python27\lib\site-packages\yaml\scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "C:\Python27\lib\site-packages\yaml\scanner.py", line 231, in fetch_more_tokens
return self.fetch_anchor()
File "C:\Python27\lib\site-packages\yaml\scanner.py", line 621, in fetch_anchor
self.tokens.append(self.scan_anchor(AnchorToken))
File "C:\Python27\lib\site-packages\yaml\scanner.py", line 936, in scan_anchor
% ch.encode('utf-8'), self.get_mark())
yaml.scanner.ScannerError: while scanning an anchor
in "test.yaml", line 1, column 3
expected alphabetic or numeric character, but found '.'
in "test.yaml", line 1, column 6
The same thing happens with the C version:
>>> import yaml
>>> yaml.load(open('test.yaml'), Loader=yaml.CSafeLoader)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\yaml\__init__.py", line 114, in load
return loader.get_single_data()
File "C:\Python27\lib\site-packages\yaml\constructor.py", line 66, in get_single_data
node = self.get_single_node()
File "ext\_yaml.pyx", line 707, in _yaml.CParser.get_single_node
File "ext\_yaml.pyx", line 725, in _yaml.CParser._compose_document
File "ext\_yaml.pyx", line 774, in _yaml.CParser._compose_node
File "ext\_yaml.pyx", line 849, in _yaml.CParser._compose_sequence_node
File "ext\_yaml.pyx", line 905, in _yaml.CParser._parse_next_event
yaml.scanner.ScannerError: while scanning an anchor
in "test.yaml", line 1, column 3
did not find expected alphabetic or numeric character
in "test.yaml", line 1, column 6
Looking at PyYAML's source code, scanner.py
contains the following note:
# The specification does not restrict characters for anchors and
# aliases. This may lead to problems, for instance, the document:
# [ *alias, value ]
# can be interpreted in two ways, as
# [ "value" ]
# and
# [ *alias , "value" ]
# Therefore we restrict aliases to numbers and ASCII letters.
But an exception is made for underscores and minuses:
while u'0' <= ch <= u'9' or u'A' <= ch <= u'Z' or u'a' <= ch <= u'z' \
or ch in u'-_':
Would it be possible to add dots to the exceptions as well, or could that cause parsing ambiguities?
I agree that the dot should be allowed, and it should not cause any ambiguity. Also the forward slash should be added IMHO. Also libyaml should be changed accordingly.
Need this please.
Please fix this issue. Other YAML interpreters handle this without errors.
PyYAML (and possibly LibYAML?) is not compliant with the spec. I reviewed both YAML 1.0 and YAML 1.2.2 as published on yaml.org and dots along with most other non-breaking, non-space Unicode characters are allowed for for anchor and alias tags. These should be allowed and are used in the wild.
I believe PyYAML claims to be YAML 1.1 compliant so here is the relevant portion of the spec for their syntax:
https://yaml.org/spec/1.1/#ns-anchor-name
c-ns-anchor-property ::= “&” ns-anchor-name
ns-anchor-name ::= ns-char+
ns-char ::= nb-char - s-white
nb-char ::= c-printable - b-char
c-printable ::= #x9 | #xA | #xD | [#x20-#x7E] /* 8 bit */
| #x85 | [#xA0-#xD7FF] | [#xE000-#xFFFD] /* 16 bit */
| [#x10000-#x10FFFF] /* 32 bit */
b-char ::= b-line-feed | b-carriage-return | b-next-line | b-line-separator | b-paragraph-separator
b-line-feed ::= #xA /*LF*/
b-carriage-return ::= #xD /*CR*/
b-next-line ::= #x85 /*NEL*/
b-line-separator ::= #x2028 /*LS*/
b-paragraph-separator ::= #x2029 /*PS*/
s-white ::= #x9 /*TAB*/ | #x20 /*SP*/
Ran into this issue today, we also were using '.'
as a namespacing separator and would've liked to use it in our anchors.
The most relevant PR in this thread seems to be https://github.com/yaml/libyaml/pull/170, which fixed the issue in libyaml, but then was later reverted. Is this issue indefinitely in limbo until something in the yaml spec clarifies allowed characters?
Is there any way in pyyaml
you'd recommend locally overriding this behavior? e.g extending the Scanner class and overriding scan_anchor
(https://github.com/yaml/pyyaml/blob/main/lib/yaml/scanner.py#L899)?
@matthewgrossman As I mentioned in my earlier comment, the YAML spec is very clear that dot is an allowed character in multiple versions of the spec. This is clearly a bug in PyYAML and/or LibYAML. I've since switched to ruamel.yaml as my Python YAML library since it is being regularly updated, supports YAML 1.2.x unlike this package, and has the dot character issue fixed. It also features an API similar to PyYAML and only needing a few tweaks in my experience.
Thank for the response, agreed that yaml 1.2
does allow many more characters, and that's related to why it was reverted in the first place: https://github.com/yaml/libyaml/pull/170#issuecomment-605720525. It seems a variety of contributors are convinced the over-permissivity of allowed anchor-characters in 1.2
could cause problems for libraries like pyyaml
/libyaml
, so opted to not add more allowed characters until something was more formalized in future specs (other discussion thread). The proposed RFC now 404s for me (https://github.com/yaml/yaml-spec/blob/main/rfc/RFC-0003.md) and discussion was around formalizing for yaml 1.3 anyways 😕
I'll look into using ruamel.yaml instead, thanks for the pointer @penguin359