csv-validator validate incoming stdin

Hi,

it would be great to be able to do zcat myfile.csv.gz | validate myschema.csvs. It should be easy to do given that you already handle streams. I have tried validate $(zcat myfile.csv.gz) myschema.csvs but it runs out of memory on the decompression.

Sep 08 '17 12:09 c3iot-santilh

At the moment we don't support reading from std-in, but it is an interesting idea. May I ask how large your CSV file is compressed, and also how large it is uncompressed?

Sep 08 '17 13:09 adamretter

The largest file is 300Gb compressed, about 4x uncompressed.

Sep 08 '17 13:09 c3iot-santilh

@c3iot-santilh Wow that is quite something! Is this a file that you might be able to share with me for testing purposes?

Sep 08 '17 13:09 adamretter

Unfortunately, I cannot. But I can share a sufficiently-large fake csv file if required. Would you use an approach like this?

Sep 08 '17 13:09 c3iot-santilh

@c3iot-santilh No. Named pipes are for IPC between two processes which know the name of the pipe. In this case we would simply read from std-in, which in Java is available as a Reader.

Do you have a script for generating such a fake CSV file that you could share?

Sep 08 '17 13:09 adamretter

Would this work? http://ask.metafilter.com/227734/Giant-CSV-Files-Needed

Sep 08 '17 13:09 c3iot-santilh

@c3iot-santilh I can generate something. I just wanted something representative of what you were using if you had a script for such a thing. No worries otherwise.

Sep 08 '17 14:09 adamretter

We will add this to our prioritisation backlog. We would be happy to receive any pull requests around this feature.

Implementing this will likely have effects on providing progress updates and memory issues (as you have mentioned above).

Nov 13 '17 11:11 paulyoung84

In addition to the decompression use case, it’s worth noting that validation against a schema may only be one part of a more complex pipeline. I can’t share any sample either, but here we have processing happening before and after our call to validate and all the other tools can be put on either sides of a pipe. We are forced to write temporary files just so that we can call validate on them.

Jan 28 '20 14:01 afranke

csv-validator csv-validator copied to clipboard

validate incoming stdin

csv-validator
csv-validator copied to clipboard