hledger icon indicating copy to clipboard operation
hledger copied to clipboard

CSV import, limit number of fields

Open PSLLSP opened this issue 3 years ago • 2 comments

I would like to have a directive to limit number of valid fields.

Quality of CSV files differs, it is expected that user will open these in spreadsheet, like MS Excel. Spreadsheets are not strict on CSV specification and that leads to situation that quality of CSV files is low... :-(

I have an example of CSV file where some records have an extra field, like an extra note. Such CSV file breaks CSV specification but such file is not a problem for Excel. It is showstopper for hledger because it expects that each line has the same number of fields, this is CSV format specification.

So, I think hledger should have a directive that will instruct it to validate only n first fields and ignore the rest. The directive could define mode how hledger will process lines that doesn't match that number of fields. I see 4 possible modes:

  • Only lines with n fields should be processed and the rest should be ignored
  • Each line should be processed but only the first n fields used
  • Missing fields are created with empty value; I understand this is dangerous... ;-)
  • Strict mode, current hledger behavior. Broken CSV file is rejected with an error. The default mode.

Some warning switch should be added, to troubleshoot 'invalid' CSV files... Current situation is that when hledger founds a line with an extra filed, it stops processing with an error.

PSLLSP avatar Dec 12 '21 21:12 PSLLSP

Is this something that could be handled by appropriate preprocessing? For example, with the cut command? cut -d, -f="-10" will take only the first 10 fields and ignore the others (though it will admittedly have problems if there are any quoted commas within fields).

Xitian9 avatar Dec 12 '21 22:12 Xitian9

I am heavy user of AWK, I love awk. Problem with CSV files is that those are difficult to process with powerful awk or simpler cut. Well, you can process some basic CSV files with cut but to handle all escaping rules and double quotes rules, it is difficult and cut just cannot do that reliably... This is why I prefer UNL files in my tasks... ;-)

I know how to handle this situation without support in hledger... I just realized this common pattern and I wanted to highlight it; I think this support in hledger could be useful. Like skip 1 directive, you can handle that easy without hledger, tail -n +2 FILE.csv but skip in hledger makes live easier... Other point is that hledger is multiplatform, so you can wtrite rules file that will work for UNIX users as well for WIN users. Win users doesn't have cut in their system, they have to preprocess files in a different way...

UPDATE: links where are documented some "naive" ways to parse CSV files in modern AWK gawk. Those could work, sometimes... ;-) The real solution is full parser, a library like "csvmode"

https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html https://www.gnu.org/software/gawk/manual/html_node/More-CSV.html http://mcollado.z15.es/xgawk/ - "csvmode" library

PSLLSP avatar Dec 12 '21 23:12 PSLLSP