zed icon indicating copy to clipboard operation
zed copied to clipboard

Storing/querying huge values

Open philrz opened this issue 2 years ago • 1 comments

At the time of the filing of this issue, Zed is at commit 9abaa77.

A community user asked the following in a Slack thread:

I am working with some health care plans generating the "Transparency Machine Readable Files" (MRF) and am having issues with certain files (example file https://app0004702110a5prdnc685.blob.core.windows.net/output/2023-09-15_Blue-Cross-and-Blue-Shield-of-Texas_Blue-Essentials_in-network-rates.json.gz) - I am able to load the data into a pool, but when I try to query it, I get zngio: frame length 2962410246 exceeds maximum allowed 1073741824 . I have a few others, which the raw JSON is over 100 Gb in size, and I needed to do some creative work with swap files on my machine to get them to load.

this is a production dataset of the Machine readable file format mandated by the US CMS (Center for Medicare Services). All health care insurers are required to follow the schema, though there are a lot of variances in how they are processed. somehow this is supposed to make shopping for healthcare easier, as long as you can process this amount of data :slightly_smiling_face: . Some insurers do one file per provider (thousands of files, terabytes of data), others consolidate in various ways. Right now there is no good way to process these files, the other (Much larger file) I have tried and was able to load into a lake after increasing swap space to 512 gb is https://mrf.healthsparq.com/aetnacvs-egress.nophi.kyruushsq.com/prd/mrf/AETNACVS_I/ALI[…]42e9-88bb-7fcd5db538aa_Aetna-Life-Insurance-Company.json.gz

For the first file I shared, I was able to update reader.go in the zngio module to change the MaxSize constant:

const (
       ReadSize  = 512 * 1024
       MaxSize   = 1024 * 1024 * 1024 *8
       TypeLimit = 10000
)

And that works.

We looked closer at the data and confirmed that it's one gigantic JSON object. Most of of Zed's JSON users that have 10-100+ GB of data usually have it in the form of modestly-sized, single-line JSON objects typically separated by newlines, or a giant JSON array of similar modestly-sized, single JSON objects, which is pretty easy to unravel while reading the array elements like a stream. Meanwhile, this user was hitting up against (and ultimately raising) the maximum size of a single ZNG value, e.g., a record, which is what a JSON object would become when read in.

When discussed with the wider Zed development team, the "no good way to process these files" sentiment was echoed, as it was noted how other data warehouses also make the assumption that single values should be able to fit into memory in order to be processed. But we were brainstorming some other ways to get at the data without taking the kinds of steps this user did. For example, tools like https://pypi.org/project/ijson/ and jq --stream (link) are ways to preprocess the JSON so it's picked apart into smaller pieces that will fit into memory. The disadvantage there would be having to know how it looks picked-apart in order to query it meaningfully, then maybe having to come up with ways to reassemble the picked-apart format into something more readable if you need to make the query result available in something downstream that needs it to look like it was in the original data.

Along those lines, it was noted that we could perhaps write a preprocessor that generates output that looks like the output from Zed's own flatten function. That would still mean being aware of how to work with that flattened representation, but if you then zero in on a manageable subset of the data that you want to turn back into original format, there's an unflatten function that could easily be applied to that subset of the data and it could be extracted, further queried, etc. It was noted that @nwt has a JSON parser that might be adapted to cover this.

philrz avatar Nov 01 '23 17:11 philrz

I didn't get anywhere close to implementing the equivalent of Zed's flatten with it, but I hacked a bit with ijson just to see what this particular data looks like "one level down". Running this Python code:

#!/usr/bin/env python3

import ijson
import json
import sys

FILENAME = '/Users/phil/Downloads/2023-09-15_Blue-Cross-and-Blue-Shield-of-Texas_Blue-Essentials_in-network-rates.json'

with open(FILENAME, 'rb') as f:
    members = ijson.kvitems(f, '', use_float=True)
    for k, v in members:
        obj = {k: v}
        print(json.dumps(obj))
        sys.stdout.flush()

The resulting output is 5.3 GB in size, with the first four lines looking like:

{"reporting_entity_name": "Blue Cross and Blue Shield of Texas"}
{"reporting_entity_type": "health insurance issuer"}
{"last_updated_on": "2023-09-15"}
{"version": "1.3.3"}

...and all the rest of the bulk on the remaining two lines. In other words, taming this data would indeed seem to need the full flatten (or similar) treatment where there'd need to be multiple levels of unwrapping until it gets to a point where the remaining values (even if they're large arrays, etc.) are hopefully small enough to work with in memory.

philrz avatar Nov 01 '23 20:11 philrz