glaze icon indicating copy to clipboard operation
glaze copied to clipboard

CSV parsing does not take into account non raw_strings (it fails if ',', '[' or ']' are inside quotes)

Open sjanel opened this issue 1 year ago • 10 comments

Hi !

I wanted to use glaze to parse the currencies from this CSV with this structure:

struct CurrencyCSV {
  vector<string> Entity;
  vector<string> Currency;
  vector<string> AlphabeticCode;
  vector<string> NumericCode;
  vector<string> MinorUnit;
  vector<string> WithdrawalDate;
};

However, the parser interprets , (and [ ]) if they are inside brackets in a value. For instance, these lines fail:

"MOLDOVA, REPUBLIC OF",Russian Ruble,RUR,810,,1993-12
"FALKLAND ISLANDS (THE) [MALVINAS]",Falkland Islands Pound,FKP,238,2,

I think that this behavior could be expected if .raw_string is true, but it should work if we set .raw_string to false.

WDYT ?

If I have the time, I will try to make a PR.

sjanel avatar Nov 20 '24 22:11 sjanel

Actively working on this issue here: #1446

stephenberry avatar Nov 21 '24 15:11 stephenberry

I think the string parsing has been fixed. It also supports escaped quotes per the CSV specification, which were needed to parse this document. However, there is a bug where it parses the WithdrawalDate into the MinorUnit. I have to work on some other stuff at the moment. If you want to try to fix this from the csv_currency branch, that would be great! Otherwise, I'll get back to this when I'm able.

stephenberry avatar Nov 21 '24 15:11 stephenberry

I think the string parsing has been fixed. It also supports escaped quotes per the CSV specification, which were needed to parse this document. However, there is a bug where it parses the WithdrawalDate into the MinorUnit. I have to work on some other stuff at the moment. If you want to try to fix this from the csv_concurrency branch, that would be great! Otherwise, I'll get back to this when I'm able.

I think I found the bug, and fixed it in this commit. Feel free to take it into your branch. It's because we are skipping the trailing ',' in both the from<CSV for string_t and in the main loop of line 616, whereas for other type parsing we don't skip the commas, so I decided to just remove it from the string_t and it seems to work.

However, I did not see any test for rowwise parsing and I cannot make my commented test pass (see csv_test.cpp:606-607). Am I calling it incorrectly ?

sjanel avatar Nov 21 '24 20:11 sjanel

Thanks, I got your fix on the branch so that it parses column wise correctly. I'll look at your commented test now.

stephenberry avatar Nov 21 '24 21:11 stephenberry

I updated the rowwise test on the csv_currency branch. The first issue was that it was still reading in a column wise CSV file. Now it write out the data in rowwise format and then tries to read that back in. The outstanding issue is that our string writing for CSV does not escape quotes.

stephenberry avatar Nov 21 '24 21:11 stephenberry

I'm going to merge the current fixes for column wise support. But, I'm going to keep this issue alive until rowwise support and escaped writing has been added. I just want to get these changes merged as a first step.

stephenberry avatar Nov 21 '24 21:11 stephenberry

I could add you as a contributor to Glaze if you'd like to make branches directly in Glaze for these CSV fixes.

stephenberry avatar Nov 21 '24 21:11 stephenberry

I updated the rowwise test on the csv_currency branch. The first issue was that it was still reading in a column wise CSV file. Now it write out the data in rowwise format and then tries to read that back in. The outstanding issue is that our string writing for CSV does not escape quotes.

oh, my bad! I thought it only changed the way the csv file was parsed internally, I did not know the existence of rowwise CSV actually! Anyway, it would be great to have an example in the README and in the unit test.

sjanel avatar Nov 22 '24 19:11 sjanel

I could add you as a contributor to Glaze if you'd like to make branches directly in Glaze for these CSV fixes.

Thanks for the proposition, why not, I really enjoy glaze so if you need help for some items I would be glad to help when I have the time.

sjanel avatar Nov 22 '24 19:11 sjanel

Cool, I sent you an invite. I find it easier to make pull requests and work from within a repository. No pressure to contribute, but any help is appreciated.

The CSV code needs some work, particularly with writing strings with escaped quotes and handling CSV files without headers.

stephenberry avatar Nov 22 '24 19:11 stephenberry

Support was added in: #1786

stephenberry avatar Jun 04 '25 20:06 stephenberry