datachain Add explode and / or dynamic model / schema

Follow up https://github.com/iterative/dvcx/pull/1368 Based also on this discussion / feedback by @tibor-mach https://iterativeai.slack.com/archives/C04A9RWEZBN/p1727194987119179 Base also on iteration on DCLM - https://github.com/iterative/studio/issues/10596

Summary

When we have a single file (JSONL or CVS/Parquet with a column with JSONs) we need a way to "explode" those JSONs/dicts into a Pythonic model and store it in DataChain not a single column, but as multiple columns - one per each path in that JSON/dict.

E.g. this is how JSONL looks like after a naive parse:

Or from the CVS file (mind the meta column):

image (11)

There is an obvious way to mitigate this - create a Model class and populate it from in the UDF that. But that's seems very annoying and redundant - model description becomes 2x/3x code of the parser.

Suggestions

[ ] DataChain.explode(C("meta")). This one is more or less obvious and requires creating an extra table.
[ ] Make functions like map, gen dynamically figure out schema and create Pydantic model as it is parsing files. This requires more complicated implementation, but can faster since it can work in a streaming mode:

Imagine something like this:

def extract(file: File) -> Iterator[File, dict]:
    with file.open() as f:
        dctx = zstd.ZstdDecompressor()
        stream_reader = dctx.stream_reader(f)
        text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
        for line in text_stream:
            yield file, json.parse(line)


DataChain.from_dataset("index").settings(cache=True).limit(1).gen(extract).save("raw_text")

Sep 27 '24 01:09 shcheklein

This may be already possible with the combination of read_meta and map depending on how we want to solve this. (read_meta requires a storage_path with an example data to create a schema or a pydantic model passed).

How are we going to determine schema? Is it based on a sample of rows (which is what read_meta does, it reads a single row) or by reading all the rows?

Sep 27 '24 04:09 skshetry

How are we going to determine schema? Is it based on a sample of rows (which is what read_meta does, it reads a single row) or by reading all the rows?

yes, based on a sample (like we do already in the from_parquet and friends)

This may be already possible with the combination of read_meta and map depending on how we want to solve this. (read_meta requires a storage_path with an example data to create a schema or a pydantic model passed).

yes, idea is the same but we need to wrap it into a user-friendly function and may be generalize a bit?

Sep 27 '24 17:09 shcheklein

datachain datachain copied to clipboard

Add explode and / or dynamic model / schema

Summary

Suggestions

datachain
datachain copied to clipboard