data-validation
data-validation copied to clipboard
Newline in CSV quoted string breaks reader
Hi, Looks like current CSV reader does not support the case where a quoted string value span a few lines (and line breaks are made). It means a logical CSV row may span a few physical lines, which is valid CSV.
Looks like some if this is indicated here? https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/coders/csv_decoder.py#L150
And some background here:
https://stackoverflow.com/questions/18724903/csvs-in-python-with-newline-in-quotes
A question, There's probably a reason for it, but why not use an actual csv reader? Edit: i'm assuming because streaming, beam, etc. want a unit = line, which makes parallelism possible.
@jondot The issue is that Beam doesn't natively support reading from CSV data. So we currently get around this by reading line-by-line and parsing each line as a CSV record.
@aaltay @chamikaramj @katsiapis
@jondot , As the issue raised by you depends on the functionality, which Beam currently doesn't support, please confirm if we can close this issue, or if you want it to be implemented as a Feature, once Beam supports it. Thanks.
yup understood. I believe since compliant CSV can include multiline quoted fields, somehow tfdv should support that. but of course It's up to you.
Let's keep it open so that users are aware that this issue exists and don't end up creating new issues.
Thanks guys. If you think you know how you want it to be implemented in terms of design and standards, I'll be happy to put in the time to implement. Though, I stay away from CLAs and that kind of things, so I'll bow out if you've got CLAs.