csv-validator
csv-validator copied to clipboard
validate incoming stdin
Hi,
it would be great to be able to do zcat myfile.csv.gz | validate myschema.csvs. It should be easy to do given that you already handle streams. I have tried validate $(zcat myfile.csv.gz) myschema.csvs but it runs out of memory on the decompression.
At the moment we don't support reading from std-in, but it is an interesting idea. May I ask how large your CSV file is compressed, and also how large it is uncompressed?
The largest file is 300Gb compressed, about 4x uncompressed.
@c3iot-santilh Wow that is quite something! Is this a file that you might be able to share with me for testing purposes?
Unfortunately, I cannot. But I can share a sufficiently-large fake csv file if required. Would you use an approach like this?
@c3iot-santilh No. Named pipes are for IPC between two processes which know the name of the pipe. In this case we would simply read from std-in, which in Java is available as a Reader.
Do you have a script for generating such a fake CSV file that you could share?
Would this work? http://ask.metafilter.com/227734/Giant-CSV-Files-Needed
@c3iot-santilh I can generate something. I just wanted something representative of what you were using if you had a script for such a thing. No worries otherwise.
We will add this to our prioritisation backlog. We would be happy to receive any pull requests around this feature.
Implementing this will likely have effects on providing progress updates and memory issues (as you have mentioned above).
In addition to the decompression use case, it’s worth noting that validation against a schema may only be one part of a more complex pipeline. I can’t share any sample either, but here we have processing happening before and after our call to validate and all the other tools can be put on either sides of a pipe. We are forced to write temporary files just so that we can call validate on them.