dsq dsq --schema missing array in 11GB file

dsq --schema missing array in 11GB file

Open mccorkle opened this issue 2 years ago • 2 comments

Describe the bug and expected behavior

In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.

Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?

Reproduction steps With a 11GB (or larger) file: dsq --schema --pretty LARGE_FILE.json

Versions

OS: Ubuntu 22.04 LTS, AMD EPYC 7R32
Shell: bash
dsq version: dsq 0.20.2 from apt

Jul 21 '22 18:07 mccorkle

Hey! Thanks for the report. Yeah datastation/dsq does sampling to get reasonable performance. Maybe it makes sense to sample a larger file but then performance is going to get much worse. Overall I don't yet have a great strategy for dealing with very large files.

Jul 22 '22 14:07 eatonphil

Before I discovered Datastation, the way I had imagined building my own was to stream-read the file and when I see an array -- to read only the first 3 of the array's children into memory, counting but discarding all other objects in the array until I capture the last 3.

The flaw with my plan was that if there is an array child that didn't conform to the structure of the first and last 3 in the array, my report would not include them in the schema -- but it would have found this schema element that datastation/dsq is missing.

Perhaps a hybrid of your approach and mine which can be activated by an --array_depth=3 argument?

Jul 25 '22 18:07 mccorkle

dsq dsq copied to clipboard

dsq --schema missing array in 11GB file

dsq
dsq copied to clipboard