datafusion
datafusion copied to clipboard
Files without `.parquet`, `.csv` extension inferred as having no schema
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I wrote a test approximating
let file = tempfile::tempfile();
// ... write parquet data ...
let mut context = ExecutionContext::new();
context.register_parquet("t", file.path().as_str())
context.sql("select column from t");
This would result in "Invalid identifier" errors, effectively claiming the column didn't exist. I verified the file existed, had the correct columns, etc... I was very confused :laughing:
Eventually I tracked this down to the schema being inferred as empty if the extension is not ".parquet", this feels unexpected
Describe the solution you'd like
Either register_parquet should return an error if the extension is missing, or FileFormat::infer_schema should be more agnostic to file extensions.
Change this description to be a bug, which I think better reflects what is going on
@tustvold may i ask you how to write a tmp parquet file, IMHP i think all parquet data file should end .parquet.
@Ted-Jiang Remove this line in the SQL benchmark https://github.com/apache/arrow-datafusion/pull/1738/files#diff-d1dbff8af63c3a3fe4d918432f982181b40fa4b7e1641522a6a48904f521fc89R143 and that was what caused issues.
I don't feel strongly whether the .parquet file should be mandatory or not, but I do think silently inferring an empty schema makes for a pretty poor UX :sweat_smile:
but I do think silently inferring an empty schema makes for a pretty poor UX 😅
I agree -- an error about "can not infer schema" seems much better than silently ignoring
I plan on fixing this for 14.0.0
Some notes from investigating this.
This is where the EmptyExec gets introduced. I ran into this issue in Ballista as well in https://github.com/apache/arrow-ballista/pull/414
// if no files need to be read, return an `EmptyExec`
if partitioned_file_lists.is_empty() {
let schema = self.schema();
let projected_schema = project_schema(&schema, projection.as_ref())?;
return Ok(Arc::new(EmptyExec::new(false, projected_schema)));
}
The problem is I am registering a CSV data source and the filename does not end with .csv and we have hard-coded assumptions that CSV files must end with ".csv", so we cannot support .tsv or .tbl (without the user explicitly specifying this, and it is not possible to specify this in SQL as far as I know).
I think we just need to add an error check for this case and prompt the user to specify the file extension they want rather than use the default.
I propose that the error check we add is that at least one file exists with the specified extension.
I propose that the error check we add is that at least one file exists with the specified extension.
Makes sense to me
I also ran into this issue. The problem seems to be even if only a single file is provided, we still try to match by extension.
https://github.com/apache/arrow-datafusion/blob/0e5f6df2c4fb2d647874102a19c74eaaf7f34d98/datafusion/core/src/datasource/listing/url.rs#L143-L175
If my use case always uses a single csv file for each table, would a reasonable workaround be setting file extension to an empty string in CsvReadOptions? Are there any problems?
ctx.register_csv(
name,
path,
CsvReadOptions::new().file_extension("")
).await?
Maybe file_extension could be made an Option<String>, defaulting to None?
I tried this usecase with the fix in https://github.com/apache/arrow-datafusion/issues/6147 from @aprimadi and it works great :
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion/datafusion-cli$ head /tmp/test/customer
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|
5|Customer#000000005|KvpyuHCplrB84WgAiGV6sYpZq7Tj|3|13-750-942-6364|794.47|HOUSEHOLD|n accounts will have to unwind. foxes cajole accor|
6|Customer#000000006|sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn|20|30-114-968-4951|7638.57|AUTOMOBILE|tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious|
7|Customer#000000007|TcGe5gaZNgVePxU5kRrvXBfkasDTea|18|28-190-982-9759|9561.95|AUTOMOBILE|ainst the ironic, express theodolites. express, even pinto beans among the exp|
8|Customer#000000008|I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5|17|27-147-574-9335|6819.74|BUILDING|among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide|
9|Customer#000000009|xKiAFTjUsCuxfeleNqefumTrjS|8|18-338-906-3675|8324.07|FURNITURE|r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl|
10|Customer#000000010|6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2|5|15-741-346-9870|2753.54|HOUSEHOLD|es regular deposits haggle. fur|
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion/datafusion-cli$ CARGO_TARGET_DIR=/Users/alamb/Software/target-df cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.29s
Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v23.0.0
❯ CREATE EXTERNAL TABLE customer STORED AS CSV DELIMITER '|' LOCATION '/tmp/test/customer';
0 rows in set. Query took 0.290 seconds.
❯ select * from customer limit 1;
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
| column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 | column_9 |
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
| 1 | Customer#000000001 | IVhzIApeRb ot,c,E | 15 | 25-989-741-2988 | 711.56 | BUILDING | to the even, regular platelets. regular, ironic epitaphs nag e | |
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
1 row in set. Query took 0.065 seconds.