pyyaml icon indicating copy to clipboard operation
pyyaml copied to clipboard

ENH: Keep information about line numbers

Open quazgar opened this issue 3 years ago • 5 comments

It would b great for further processing and giving user-friendly feedback about semantic errors if there was a way to store metadata about loaded objects. This metadata could include:

  • line number or origin
  • original text representation

Actually there seems to be some stubby implementation over at StackOverflow already: https://stackoverflow.com/questions/13319067/parsing-yaml-return-with-line-number

quazgar avatar Nov 12 '20 11:11 quazgar

This would be handy! Having spent a day trying to monkey patch a solution (and failing)...

james-powis avatar Jan 08 '21 22:01 james-powis

I would love this, too!

nickleroy avatar May 10 '21 21:05 nickleroy

If you don't mind working with PyYAML's internal data structures, here's a way:

import yaml

content = """- a
- { b: 0, c: 10 }
- d
"""

loader = yaml.Loader(content)
node = loader.get_single_node()

print(node)

Running the above piece of code will produce the following output (save for the formatting):

SequenceNode(tag='tag:yaml.org,2002:seq', value=[
    ScalarNode(tag='tag:yaml.org,2002:str', value='a'),
    MappingNode(tag='tag:yaml.org,2002:map', value=[
        (
            ScalarNode(tag='tag:yaml.org,2002:str', value='b'),
            ScalarNode(tag='tag:yaml.org,2002:int', value='0')
        ), (
            ScalarNode(tag='tag:yaml.org,2002:str', value='c'),
            ScalarNode(tag='tag:yaml.org,2002:int', value='10')
        )
        ]
    ),
        ScalarNode(tag='tag:yaml.org,2002:str', value='d')
    ]
)

Also see:

  • SafeLoader: https://github.com/yaml/pyyaml/blob/master/lib/yaml/loader.py#L31
  • load(): https://github.com/yaml/pyyaml/blob/master/lib/yaml/init.py#L74
  • get_single_data(): https://github.com/yaml/pyyaml/blob/master/lib/yaml/constructor.py#L47
  • get_single_node(): https://github.com/yaml/pyyaml/blob/master/lib/yaml/composer.py#L29

(permanent links, like: github.com/yaml/pyyaml/blob/8cdff2c80573b8be8e8ad28929264a913a63aa33/lib/yaml/loader.py#L31)

mathieucaroff avatar Jan 26 '22 19:01 mathieucaroff

@mathieucaroff

I think that's exactly the same as:

import yaml

content = """- a
- { b: 0, c: 10 }
- d
"""

node = yaml.compose(content)

print(node)

ingydotnet avatar Jan 26 '22 20:01 ingydotnet

@ingydotnet

I think that's exactly the same as (...)

I agree, thanks! I find your solution, yaml.compose(x), to be a better way to obtain the node tree of the document. Indeed, it is shorter and cleaner as it finally .dispose()-s of the loader.

mathieucaroff avatar Jan 27 '22 08:01 mathieucaroff

Yet another approach.

import yaml

class MyLoader(yaml.SafeLoader):

    def __init__(self, stream):
        super().__init__(stream)
        self.locations = {}

    def compose_node(self, parent, index):
        node = super().compose_node(parent, index)
        node._myloader_location = (self.line, self.column)
        return node

    def construct_object(self, node, deep=False):
        obj = super().construct_object(node, deep=deep)
        key = id(obj)
        if key in self.locations:
            self.locations[key] = None
        else:
            self.locations[key] = node._myloader_location
        return obj

    @classmethod
    def load(cls, stream):
        loader = cls(stream)
        try:
            return loader.get_single_data(), loader.locations
        finally:
            loader.dispose()

MyLoader.load(stream) returns a 2-tuple. The first element is the object that would be returned by a call to yaml.safe_load() (list, dictionary, scalar, None, etc.), and the second element is a dictionary containing the location information ((line, column) tuples) for most of the objects within the "normal" object, indexed by their ID.

For example:

>>> yml = "- foo\n- bar\n- baz\n"
>>> print(yml)
- foo
- bar
- baz

>>> doc, locations = MyLoader.load(yml)
>>> doc
['foo', 'bar', 'baz']
>>> locations[id(doc[0])]
(1, 0)
>>> locations[id(doc[1])]
(2, 0)
>>> locations[id(doc[2])]
(3, 0)

This does have 2 limitations.

First, if more than one occurrence of a singleton (such as None) is encountered, no location is returned for that object ID. (The location dictionary will map the singleton's ID to None.)

>>> yml = "-\n-\n-\n"
>>> print(yml)
-
-
-

>>> doc, locations = MyLoader.load(yml)
>>> doc
[None, None, None]
>>> print(locations[id(doc[0])])
None
>>> locations
{140414278062784: (3, 0), 140414513597984: None}

The second limitation of this approach is that the location information returned for sequences and mappings (Python lists and dictionaries) reflects the end of those objects, rather than the beginning.

>>> yml = "- foo\n- bar\n- baz\n"
>>> print(yml)
- foo
- bar
- baz

>>> doc, locations = MyLoader.load(yml)
>>> locations[id(doc)]
(3, 0)

Despite these limitations, I've still found that this approach can help improve error reporting when working with parsed YAML.

ipilcher avatar Nov 04 '22 23:11 ipilcher