emit_yaml in Python returns unexpected object

Open matham opened this issue 6 months ago • 1 comments

I'm testing ryml to use to parse very large yaml files in Python. I'm following the approach from here, that uses json as an intermediary to handle Python objects https://github.com/4C-multiphysics/fourcipp/blob/8d9b5b76320643b54e797224d2dffc3984a3e961/src/fourcipp/utils/yaml_io.py.

My code is this:

def dict_to_yaml_string(data: dict) -> str:
    # Convert dictionary into a ryml tree
    tree = ryml.parse_in_arena(json.dumps(data).encode("utf8"))

    # remove all style bits to enable a YAML style output
    # see https://github.com/biojppm/rapidyaml/issues/520
    for node_id, _ in ryml.walk(tree):
        if tree.is_map(node_id) or tree.is_seq(node_id):
            tree.set_container_style(node_id, ryml.NOTYPE)

        if tree.has_key(node_id):
            tree.set_key_style(node_id, ryml.NOTYPE)

        if tree.has_val(node_id):
            tree.set_val_style(node_id, ryml.NOTYPE)

    return ryml.emit_yaml(tree)

I've tried it with small amount of data and it works and returns the yaml data. When I tried it with a very large amount of data, I get instead this object: <Swig Object of type 'char *' at 0x000001C9E51DB210>. I'm not sure what to do with this.

I assume some error occurred but I don't know how to get any error info out of this. I made sure the problem is not with json because I was able to save the json output to a 5GB file.

I see some references to similar output here and here which points to this and perhaps is this problem https://github.com/swig/swig/issues/3134. If it's the same issue, it'll be fixed with the latest (next) swig release.

compute_emit_yaml_length returns 7727850851 for the data.

Is there some kind of workaround meanwhile? I think it may take a while for their next release based on their tag history. Is there maybe some way to emit parts of the output in smaller sizes, assuming that's the issue?

Jul 04 '25 01:07 matham

I found a workaround - create the buffer manually. Instead of

    return ryml.emit_yaml(tree)

I now have

    n = ryml.compute_emit_yaml_length(tree)
    buffer = bytearray(n)
    res = ryml.emit_yaml_in_place(tree, buffer)
    assert res.nbytes == n
    return buffer

which seems to work. I assume it's slower than the first version since we're counting and then writing, unless it's somehow cached internally? But that's fine for me.

Jul 04 '25 07:07 matham