json-stream
json-stream copied to clipboard
Break up large json list into smaller batches for processing
Let's say I have some 3rd party library that deals with JSON but isn't capable of streaming. I have a large amount of JSON, more data than we want to keep in memory, and we want to break it up into smaller chunks to pass to this library so that it doesn't run out of memory when it tries.
The JSON consists of lists of objects with semi-arbitrary contents, but we don't want this code to care about the details of what's in the objects, just break up the list into chunks of, say, 100 at a time.
I have this code, and it works, but it's not very clean.
data = json_stream.load(fd)
batch = []
for row in data:
d = {}
for k, v in row.items():
d[k] = v
batch.append(d)
if len(batch) >= IMPORT_BATCH_SIZE:
yield json.dumps(batch)
batch.clear()
yield json.dumps(batch)
As you can see I'm taking the TransientSomethingOrOther
objects handed back by json_stream
and manually casting it back to a dict, adding it to a list, and then writing JSON from that list.
Is there a way to avoid some of this mess which I'm missing? A way to convert the objects to dict
more easily (I tried casting with dict()
, it didn't work) or use streamable_list
in some way to help facilitate this (I read the docs a few times and spent 45 minutes trying it but it didn't seem applicable?)
Hey, so, originally I had tried json_stream.to_standard_types()
on data
(in the above example) and obviously that didn't work because it converted the whole json back to a list, which isn't what I wanted.
But I just realized that it works on row
too, so this is fine.
data = json_stream.load(fd)
batch = []
for row in data:
batch.append(json_stream.to_standard_types(row))
if len(batch) >= IMPORT_BATCH_SIZE:
yield json.dumps(batch)
batch.clear()
yield json.dumps(batch)
That's better! Am I missing anything else?
FYI, it might have helped if the documentation included an example of iterating over the top-level document this way, initially we thought this library wouldn't work because in the documentation it is almost always used in such a way where it's implied that you know the structure of the data. The same is true of to_standard_types
Definitely worth updating the docs to with this use case in mind.
Contributions are always welcome :)