Failing to detect separator for semicolon separated file
Describe the bug
Using the same semicolon example from test code
date;temperature;place 2011-01-01;1;Galway 2011-01-02;-1;Galway 2011-01-03;0;Galway 2011-01-01;6;Berkeley 2011-01-02;8;Berkeley 2011-01-03;5;Berkeley
as a .csv file fails to detect the separator
Expected behavior
To detect columns
Thanks @avdata99 for the report.
For mimetype detection, DP+ uses Python's mimetypes module and there is no auto-delimiter inferencing. And since the extension is .csv, it expects a comma.
However, qsv - which powers DP+ analysis, can handle this scenario, using different approaches:
-
set the delimiter automatically for different CSV dialects, based on the file extension (
.csvfor comma;.ssvfor semicolon;.tsvand.tabfor tab). -
qsv also has a
sniffcommand that doesn't depend on the file extension to not only infer the delimiter but also other attributes:
$ qsv sniff testdelim.csv
Path: /Users/jdoe/testdelim.csv
Sniff Timestamp: 2025-11-05T13:00:12.096995+00:00
Last Modified: 2025-11-05T13:00:05+00:00
Delimiter: ;
Header Row: true
Preamble Rows: 0
Quote Char: none
Flexible: false
Is UTF8: true
Detected Mime Type: text/plain
Detected Kind: Other
Retrieved Size (bytes): 150
File Size (bytes): 150
Sampled Records: 6
Estimated: false
Num Records: 6
Avg Record Len (bytes): 18
Num Fields: 3
Stats Types: false
Fields:
0: Date date
1: Signed temperature
2: Text place
$ qsv sniff testdelim.csv --pretty-json
{
"path": "/Users/jdoe/testdelim.csv","sniff_timestamp": "2025-11-05T13:02:46.714278+00:00","last_modified": "2025-11-05T13:00:05+00:00","delimiter_char": ";","header_row": true,"preamble_rows": 0,"quote_char": "none","flexible": false,"is_utf8": true,"detected_mime": "text/plain","detected_kind": "Other","retrieved_size": 150,"file_size": 150,"sampled_records": 6,"estimated": false,"num_records": 6,"avg_record_len": 18,"num_fields": 3,"stats_types": false,"fields": [
"date",
"temperature",
"place"
],"types": [
"Date",
"Signed",
"Text"
]
}
- It also QSV_SNIFF_DELIMITER support:
$ QSV_SNIFF_DELIMITER=1 qsv table testdelim.csv
date temperature place
2011-01-01 1 Galway
2011-01-02 -1 Galway
2011-01-03 0 Galway
2011-01-01 6 Berkeley
2011-01-02 8 Berkeley
2011-01-03 5 Berkeley
HOWEVER, we do not currently leverage these qsv capabilities in DP+.
Also, the current CSV sniffer I'm using in qsv to auto-detect delimiters is not bullet-proof (https://github.com/jqnatividad/qsv-sniffer). It will be replaced with a new library (https://github.com/jqnatividad/csv-qsniffer) based on this paper:
Garcia, W. (2024). "Detecting CSV file dialects by table uniformity measurement and data type inference". Data Science, 7(2), 55-72. DOI: 10.3233/DS-240062
It's still WIP however (see https://github.com/dathere/qsv/issues/2247). ☹️ (Perhaps, you can help me with it? 😉 )
For now, I'll investigate how to leverage qsv's EXISTING auto-delimiter detection in DP+.
@avdata99 I wanted to mention that the datapusher-plus_testing framework could be really helpful for testing files as you work on this issue. For anyone not familiar with it, datapusher-plus_testing is an automated testing suite that validates DataPusher+ functionality across multiple file formats. It's particularly useful for scenarios like this because:
You can drop test files (like the semicolon-delimited CSV from this issue) into the tests/custom directory Run the GitHub Actions workflow to test against different DP+ branches Get detailed logs showing exactly what happened during ingestion:
ckan_worker.log contains the full DataPusher+ trace test_results.csv shows pass/fail status for each file worker_analysis.csv breaks down performance metrics