json-stream Break up large json list into smaller batches for processing

Let's say I have some 3rd party library that deals with JSON but isn't capable of streaming. I have a large amount of JSON, more data than we want to keep in memory, and we want to break it up into smaller chunks to pass to this library so that it doesn't run out of memory when it tries.

The JSON consists of lists of objects with semi-arbitrary contents, but we don't want this code to care about the details of what's in the objects, just break up the list into chunks of, say, 100 at a time.

I have this code, and it works, but it's not very clean.

    data = json_stream.load(fd)
    batch = []
    for row in data:
        d = {}
        for k, v in row.items():
            d[k] = v
        batch.append(d)
        if len(batch) >= IMPORT_BATCH_SIZE:
            yield json.dumps(batch)
            batch.clear()
    yield json.dumps(batch)

As you can see I'm taking the TransientSomethingOrOther objects handed back by json_stream and manually casting it back to a dict, adding it to a list, and then writing JSON from that list.

Is there a way to avoid some of this mess which I'm missing? A way to convert the objects to dict more easily (I tried casting with dict(), it didn't work) or use streamable_list in some way to help facilitate this (I read the docs a few times and spent 45 minutes trying it but it didn't seem applicable?)

Oct 19 '23 02:10 dralley

Hey, so, originally I had tried json_stream.to_standard_types() on data (in the above example) and obviously that didn't work because it converted the whole json back to a list, which isn't what I wanted.

But I just realized that it works on row too, so this is fine.

    data = json_stream.load(fd)
    batch = []
    for row in data:
        batch.append(json_stream.to_standard_types(row))
        if len(batch) >= IMPORT_BATCH_SIZE:
            yield json.dumps(batch)
            batch.clear()
    yield json.dumps(batch)

That's better! Am I missing anything else?

Oct 19 '23 02:10 dralley

FYI, it might have helped if the documentation included an example of iterating over the top-level document this way, initially we thought this library wouldn't work because in the documentation it is almost always used in such a way where it's implied that you know the structure of the data. The same is true of to_standard_types

Oct 19 '23 02:10 dralley

Definitely worth updating the docs to with this use case in mind.

Contributions are always welcome :)

Oct 19 '23 15:10 daggaz

json-stream json-stream copied to clipboard

Break up large json list into smaller batches for processing

json-stream
json-stream copied to clipboard