Improve error messages to include file name having ingestion issues
If a malformed CSV file gets added to a directory, it can fail data ingestion. In such case the error currently doesn't include the exact file name causing issues. Add the corrupt file name to the error returned. Optionally also add the ability to skip corrupted files as well.
If this is caused by DuckDB not showing the file name, we should consider just raising the issue in their issue tracker instead.
I see the error message from duckDB is very helpful on main and not on 1.0.0
Error msg from main :
Conversion Error: CSV Error on Line: 1670846
Original Line: B00310,HELLO,2022-07-20 07:02:30,,242.0,,B03404
Error when converting column "pickup_datetime". Could not convert string "HELLO" to 'TIMESTAMP'
Column pickup_datetime is being converted as type TIMESTAMP
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'pickup_datetime': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.
file=data_22.csv
delimiter = , (Auto-Detected)
quote = " (Auto-Detected)
escape = " (Auto-Detected)
new_line = \n (Auto-Detected)
header = true (Auto-Detected)
skip_rows = 0 (Auto-Detected)
comment = \0 (Auto-Detected)
date_format = (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding=0
sample_size=20480
ignore_errors=false
all_varchar=0
I will pick this once duckdb 1.1.0 is release which is scheduled to release on 2024-09-02
Nothing to be done on this from our side. This will already be part of error messages. Sample below :
However the file name is trimmed.
The file name being trimmed will be handled in a separate issue in https://github.com/rilldata/rill/issues/5604