PapaParse icon indicating copy to clipboard operation
PapaParse copied to clipboard

Space in front of quotation doesn't recognise a field with multiple entries separated by common delimiter

Open Roberto-Circit opened this issue 2 years ago • 5 comments

Using papaparse 5.3.0 version.

Having an empty space in front of quotation when using field such as address with multiple fields separated by common delimiter does not parse correctly random name, 12345, "address line 1, address line 2, code" results.data would have an item with 5 entries would pick up as 5 separate entries, while

random name, 12345,"address line 1, address line 2, code" results.data would have an item with 3 entries would work normally.

This issue only occurs with double quoted fields with multiple entries in it. Let me know if this isn't enough to go on or if this has been fixed in the new 6.0 version

Roberto-Circit avatar Jul 14 '22 08:07 Roberto-Circit

I'm not sure to understand. Which separator and quote char are you using? Which are the results on any of such fields?

pokoli avatar Jul 14 '22 11:07 pokoli

Can confirm this, ran into the exact same problem.

Example (first row has no problems, second row will be parsed as 4 columns):

"Cobra MK II","some wear, docking computer installed", 80 megacredits
"Milennium Falcon", "surface rust, light beam damage", 150 megacredits

The quotation character used is the default (double quotation mark ") and the separator is comma, but the problem does occur with other delimiters as well.

If the field starts with a space (before the quotation character), the quotation character is not recognized, and any delimiter characters inside the quoted value will be interpreted as delimiters. (The parsing code probably assumes that fields separated by delimiters will not have any preceding (or trailing) whitespace, and assumes that if the quote character doesn't immediately follow the delimiter, the field is not quoted).

CSV files with optional whitespace around delimiter characters and/or data values exist, it's common especially in hand-edited CSV files and CSV files where columns are aligned with spaces for readability (in addition to using a delimiter character).

Using the transform configuration option to remove starting and trailing whitespace doesn't help, as the transform is run after the quotes are processed.

Setting the field delimiter to ", " (comma followed by a space) breaks in cases where there is no space after the comma (or multiple spaces, if someone tried to manually align columns in addition to using a delimiter character).

Maybe add an option to automatically remove whitespace around unquoted values and outside quoted string values. Using that option would fix this problem, and is in any case something that needs to be done for CSV files where optional whitespace is present (using a transform function or when processing the data later) (e.g. the rhird column in the first row of my example is " 80 megacredits" when parsed, but the user probably wanted a result such as "80 megacredits" (with whitespace trimmed)). Some csv files might rely on storing whitespace around unquoted values, hence why this probably should be an option, but it could be on by default, as that seems to be the most common usecase. (Whitespace inside quoted strings should of course always be preserved).

fractalpixel avatar Aug 08 '22 12:08 fractalpixel

I'm not sure we should implement something to fix hand-edited files. If someone edits a file and it breaks the format is normall that is not correctly parsed.

pokoli avatar Aug 08 '22 13:08 pokoli

Maybe add an option to automatically remove whitespace around unquoted values and outside quoted string values. Using that option would fix this problem, and is in any case something that needs to be done for CSV files where optional whitespace is present (using a transform function or when processing the data later) (e.g. the rhird column in the first row of my example is " 80 megacredits" when parsed, but the user probably wanted a result such as "80 megacredits" (with whitespace trimmed)). Some csv files might rely on storing whitespace around unquoted values, hence why this probably should be an option, but it could be on by default, as that seems to be the most common usecase. (Whitespace inside quoted strings should of course always be preserved).

Agreed, having this option would be nice

Roberto-Circit avatar Aug 15 '22 11:08 Roberto-Circit

Also running into this issue, would be good if the library could successfully parse csv files with spaces after commas, like alternatives do

HarryPeach avatar Mar 24 '23 15:03 HarryPeach