dsq
dsq copied to clipboard
dsq --schema missing array in 11GB file
Describe the bug and expected behavior
In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.
Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?
Reproduction steps
With a 11GB (or larger) file:
dsq --schema --pretty LARGE_FILE.json
Versions
- OS: Ubuntu 22.04 LTS, AMD EPYC 7R32
- Shell: bash
- dsq version:
dsq 0.20.2
from apt
Hey! Thanks for the report. Yeah datastation/dsq does sampling to get reasonable performance. Maybe it makes sense to sample a larger file but then performance is going to get much worse. Overall I don't yet have a great strategy for dealing with very large files.
Before I discovered Datastation, the way I had imagined building my own was to stream-read the file and when I see an array -- to read only the first 3 of the array's children into memory, counting but discarding all other objects in the array until I capture the last 3.
The flaw with my plan was that if there is an array child that didn't conform to the structure of the first and last 3 in the array, my report would not include them in the schema -- but it would have found this schema element that datastation/dsq is missing.
Perhaps a hybrid of your approach and mine which can be activated by an --array_depth=3
argument?