csv-importer icon indicating copy to clipboard operation
csv-importer copied to clipboard

Defining delimiter inside CSV files to import

Open nick-rv opened this issue 1 year ago • 4 comments

Following the https://github.com/pelias/csv-importer/issues/110 issue i could import my dataset successfully, using the default comma delimiter.

But several thousands of records could not be indexed because they contain commas inside some of their fields. In order to have a correct result, it seems that to be able to choose an arbitrary character as delimiter is the solution.

Attempted Solutions

I tried to change my initial csv file delimiter from ";" to ",", and the import job could reach its end.

Proposal

One idea would be to allow the definition of a chosen delimiter inside the pelias.json conf file:

{ ... "imports": { "adminLookup": { "enabled": false }, "csv": { "delimiter": "§", "datapath": "/data", "files": ["adresses-france.csv"] } } }

This character would be used as a value for the delimiter attribute of the csv-parser instance: https://csv.js.org/parse/options/delimiter/ To apply this configuration for all the csv files to import seems ok by my point of view.

References

https://github.com/pelias/csv-importer/issues/110

Thanks!

nick-rv avatar Nov 07 '24 15:11 nick-rv

If we add this option it should be per file rather than global IMO.

The solution suggested in this issue description would mean that the provided delimiter would be used for all files listed in the array, I suspect this would become an issue when there are a mix of comma-delimited and other-delimited files in the list.

As the files field is of type Array<string> we could consider prefixing a parsing hint string such as:

"files": ["tsv://adresses-france.csv"]

Also worth considering are the other parser options here, I could see someone asking to be able to modify some of the other rules at a later date, this string prefix method doesn't scale well in that regard.

Additionally, I don't recall if we support compressed files such as .csv.gz, if so we'd need to consider the impact of these hints on that, as well as removing the prefix in the right places before attempting to download or decompress the file.

An alternative would be to change the type of the files field to Array<string|object> which is a little messier but more extendible.

Finally one option is to simply say that this library only supports commas, document that and expect users to format shift their data to meet those requirements.

missinglink avatar Nov 08 '24 11:11 missinglink

Also worth mentioning the csv-parse library we use has an open issue to automatically discover delimiters.

We could simply wait for that to land and avoid introducing any changes to pelias/config which would later become obsolete.

https://github.com/adaltas/node-csv/issues/400

missinglink avatar Nov 08 '24 11:11 missinglink

My preference would be to wait for the linked PR to land and then enabling the auto-discover option.

missinglink avatar Nov 08 '24 11:11 missinglink

This looks like a good pragmatic approach.

nick-rv avatar Dec 05 '24 23:12 nick-rv