datapusher-plus icon indicating copy to clipboard operation
datapusher-plus copied to clipboard

Failing to detect separator for semicolon separated file

Open avdata99 opened this issue 2 months ago • 2 comments

Describe the bug

Using the same semicolon example from test code

date;temperature;place
2011-01-01;1;Galway
2011-01-02;-1;Galway
2011-01-03;0;Galway
2011-01-01;6;Berkeley
2011-01-02;8;Berkeley
2011-01-03;5;Berkeley

as a .csv file fails to detect the separator

Image

Expected behavior

To detect columns

avdata99 avatar Oct 29 '25 16:10 avdata99

Thanks @avdata99 for the report.

For mimetype detection, DP+ uses Python's mimetypes module and there is no auto-delimiter inferencing. And since the extension is .csv, it expects a comma.

However, qsv - which powers DP+ analysis, can handle this scenario, using different approaches:

  1. set the delimiter automatically for different CSV dialects, based on the file extension (.csv for comma; .ssv for semicolon; .tsv and .tab for tab).

  2. qsv also has a sniff command that doesn't depend on the file extension to not only infer the delimiter but also other attributes:

$ qsv sniff testdelim.csv
Path: /Users/jdoe/testdelim.csv
Sniff Timestamp: 2025-11-05T13:00:12.096995+00:00
Last Modified: 2025-11-05T13:00:05+00:00
Delimiter: ;
Header Row: true
Preamble Rows: 0
Quote Char: none
Flexible: false
Is UTF8: true
Detected Mime Type: text/plain
Detected Kind: Other
Retrieved Size (bytes): 150
File Size (bytes): 150
Sampled Records: 6
Estimated: false
Num Records: 6
Avg Record Len (bytes): 18
Num Fields: 3
Stats Types: false
Fields:
    0:  Date    date
    1:  Signed  temperature
    2:  Text    place
$ qsv sniff testdelim.csv --pretty-json
{
  "path": "/Users/jdoe/testdelim.csv","sniff_timestamp": "2025-11-05T13:02:46.714278+00:00","last_modified": "2025-11-05T13:00:05+00:00","delimiter_char": ";","header_row": true,"preamble_rows": 0,"quote_char": "none","flexible": false,"is_utf8": true,"detected_mime": "text/plain","detected_kind": "Other","retrieved_size": 150,"file_size": 150,"sampled_records": 6,"estimated": false,"num_records": 6,"avg_record_len": 18,"num_fields": 3,"stats_types": false,"fields": [
    "date",
    "temperature",
    "place"
  ],"types": [
    "Date",
    "Signed",
    "Text"
  ]
}
  1. It also QSV_SNIFF_DELIMITER support:
$ QSV_SNIFF_DELIMITER=1 qsv table testdelim.csv
date        temperature  place
2011-01-01  1            Galway
2011-01-02  -1           Galway
2011-01-03  0            Galway
2011-01-01  6            Berkeley
2011-01-02  8            Berkeley
2011-01-03  5            Berkeley

HOWEVER, we do not currently leverage these qsv capabilities in DP+.

Also, the current CSV sniffer I'm using in qsv to auto-detect delimiters is not bullet-proof (https://github.com/jqnatividad/qsv-sniffer). It will be replaced with a new library (https://github.com/jqnatividad/csv-qsniffer) based on this paper:

Garcia, W. (2024). "Detecting CSV file dialects by table uniformity measurement and data type inference". Data Science, 7(2), 55-72. DOI: 10.3233/DS-240062

It's still WIP however (see https://github.com/dathere/qsv/issues/2247). ☹️ (Perhaps, you can help me with it? 😉 )

For now, I'll investigate how to leverage qsv's EXISTING auto-delimiter detection in DP+.

jqnatividad avatar Nov 05 '25 13:11 jqnatividad

@avdata99 I wanted to mention that the datapusher-plus_testing framework could be really helpful for testing files as you work on this issue. For anyone not familiar with it, datapusher-plus_testing is an automated testing suite that validates DataPusher+ functionality across multiple file formats. It's particularly useful for scenarios like this because:

You can drop test files (like the semicolon-delimited CSV from this issue) into the tests/custom directory Run the GitHub Actions workflow to test against different DP+ branches Get detailed logs showing exactly what happened during ingestion:

ckan_worker.log contains the full DataPusher+ trace test_results.csv shows pass/fail status for each file worker_analysis.csv breaks down performance metrics

a5dur avatar Nov 06 '25 14:11 a5dur