data-validation icon indicating copy to clipboard operation
data-validation copied to clipboard

Newline in CSV quoted string breaks reader

Open jondot opened this issue 6 years ago • 5 comments

Hi, Looks like current CSV reader does not support the case where a quoted string value span a few lines (and line breaks are made). It means a logical CSV row may span a few physical lines, which is valid CSV.

Looks like some if this is indicated here? https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/coders/csv_decoder.py#L150

And some background here:

https://stackoverflow.com/questions/18724903/csvs-in-python-with-newline-in-quotes

A question, There's probably a reason for it, but why not use an actual csv reader? Edit: i'm assuming because streaming, beam, etc. want a unit = line, which makes parallelism possible.

jondot avatar Jul 08 '19 11:07 jondot

@jondot The issue is that Beam doesn't natively support reading from CSV data. So we currently get around this by reading line-by-line and parsing each line as a CSV record.

@aaltay @chamikaramj @katsiapis

paulgc avatar Jul 08 '19 18:07 paulgc

@jondot , As the issue raised by you depends on the functionality, which Beam currently doesn't support, please confirm if we can close this issue, or if you want it to be implemented as a Feature, once Beam supports it. Thanks.

rmothukuru avatar Jul 16 '19 10:07 rmothukuru

yup understood. I believe since compliant CSV can include multiline quoted fields, somehow tfdv should support that. but of course It's up to you.

jondot avatar Jul 16 '19 15:07 jondot

Let's keep it open so that users are aware that this issue exists and don't end up creating new issues.

paulgc avatar Jul 16 '19 19:07 paulgc

Thanks guys. If you think you know how you want it to be implemented in terms of design and standards, I'll be happy to put in the time to implement. Though, I stay away from CLAs and that kind of things, so I'll bow out if you've got CLAs.

jondot avatar Jul 16 '19 19:07 jondot