datachain
datachain copied to clipboard
Add explode and / or dynamic model / schema
Follow up https://github.com/iterative/dvcx/pull/1368 Based also on this discussion / feedback by @tibor-mach https://iterativeai.slack.com/archives/C04A9RWEZBN/p1727194987119179 Base also on iteration on DCLM - https://github.com/iterative/studio/issues/10596
Summary
When we have a single file (JSONL or CVS/Parquet with a column with JSONs) we need a way to "explode" those JSONs/dicts into a Pythonic model and store it in DataChain not a single column, but as multiple columns - one per each path in that JSON/dict.
E.g. this is how JSONL looks like after a naive parse:
Or from the CVS file (mind the meta column):
There is an obvious way to mitigate this - create a Model class and populate it from in the UDF that. But that's seems very annoying and redundant - model description becomes 2x/3x code of the parser.
Suggestions
- [ ]
DataChain.explode(C("meta")). This one is more or less obvious and requires creating an extra table. - [ ] Make functions like
map,gendynamically figure out schema and create Pydantic model as it is parsing files. This requires more complicated implementation, but can faster since it can work in a streaming mode:
Imagine something like this:
def extract(file: File) -> Iterator[File, dict]:
with file.open() as f:
dctx = zstd.ZstdDecompressor()
stream_reader = dctx.stream_reader(f)
text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
for line in text_stream:
yield file, json.parse(line)
DataChain.from_dataset("index").settings(cache=True).limit(1).gen(extract).save("raw_text")
This may be already possible with the combination of read_meta and map depending on how we want to solve this. (read_meta requires a storage_path with an example data to create a schema or a pydantic model passed).
How are we going to determine schema? Is it based on a sample of rows (which is what read_meta does, it reads a single row) or by reading all the rows?
How are we going to determine schema? Is it based on a sample of rows (which is what read_meta does, it reads a single row) or by reading all the rows?
yes, based on a sample (like we do already in the from_parquet and friends)
This may be already possible with the combination of read_meta and map depending on how we want to solve this. (read_meta requires a storage_path with an example data to create a schema or a pydantic model passed).
yes, idea is the same but we need to wrap it into a user-friendly function and may be generalize a bit?