csv-validator icon indicating copy to clipboard operation
csv-validator copied to clipboard

validate incoming stdin

Open c3iot-santilh opened this issue 8 years ago • 9 comments

Hi,

it would be great to be able to do zcat myfile.csv.gz | validate myschema.csvs. It should be easy to do given that you already handle streams. I have tried validate $(zcat myfile.csv.gz) myschema.csvs but it runs out of memory on the decompression.

c3iot-santilh avatar Sep 08 '17 12:09 c3iot-santilh

At the moment we don't support reading from std-in, but it is an interesting idea. May I ask how large your CSV file is compressed, and also how large it is uncompressed?

adamretter avatar Sep 08 '17 13:09 adamretter

The largest file is 300Gb compressed, about 4x uncompressed.

c3iot-santilh avatar Sep 08 '17 13:09 c3iot-santilh

@c3iot-santilh Wow that is quite something! Is this a file that you might be able to share with me for testing purposes?

adamretter avatar Sep 08 '17 13:09 adamretter

Unfortunately, I cannot. But I can share a sufficiently-large fake csv file if required. Would you use an approach like this?

c3iot-santilh avatar Sep 08 '17 13:09 c3iot-santilh

@c3iot-santilh No. Named pipes are for IPC between two processes which know the name of the pipe. In this case we would simply read from std-in, which in Java is available as a Reader.

Do you have a script for generating such a fake CSV file that you could share?

adamretter avatar Sep 08 '17 13:09 adamretter

Would this work? http://ask.metafilter.com/227734/Giant-CSV-Files-Needed

c3iot-santilh avatar Sep 08 '17 13:09 c3iot-santilh

@c3iot-santilh I can generate something. I just wanted something representative of what you were using if you had a script for such a thing. No worries otherwise.

adamretter avatar Sep 08 '17 14:09 adamretter

We will add this to our prioritisation backlog. We would be happy to receive any pull requests around this feature.

Implementing this will likely have effects on providing progress updates and memory issues (as you have mentioned above).

paulyoung84 avatar Nov 13 '17 11:11 paulyoung84

In addition to the decompression use case, it’s worth noting that validation against a schema may only be one part of a more complex pipeline. I can’t share any sample either, but here we have processing happening before and after our call to validate and all the other tools can be put on either sides of a pipe. We are forced to write temporary files just so that we can call validate on them.

afranke avatar Jan 28 '20 14:01 afranke