Improved experience when remote object store URL does not end in `/`
Is your feature request related to a problem or challenge?
- part of https://github.com/apache/datafusion/issues/13456
- related to https://github.com/apache/datafusion/issues/16299
I would like to make querying files from remote stores to be easy and a great experience in DataFusion, and datafusion-cli in particular.
While testing https://github.com/apache/datafusion/pull/16300, I tried this command:
datafusion-cli
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet';
Object Store error: Object at location nyc_taxi_rides/data/tripdata_parquet not found: Error performing HEAD https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet in 142.679833ms - Server returned non-2xx status code: 404 Not Found:
This confused me for quite a while as that is a valid url (prefix)
The issue is that the url 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet' does not end in a /. If you add a / it then works great:
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 1.624 seconds.
BTW this is consistent with a local file system where selecting from a directory that doesn't end in a path works just fine:
-- Write data to `foo` directory:
> copy (values(1)) to 'foo/1.parquet';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.044 seconds.
-- Note the location doesn't end in `/` but it works fine
> create external table foo stored as parquet location 'foo';
0 row(s) fetched.
Elapsed 0.022 seconds.
> select * from foo;
+---------+
| column1 |
+---------+
| 1 |
+---------+
1 row(s) fetched.
Elapsed 0.132 seconds.
Describe the solution you'd like
I would like this to be less confusing
Describe alternatives you've considered
Alternate 1: Better Error Message
At the very least we can make the message more explicit ("Not found. Hint: if it is a directory the path should end with /")
Alternate 2: Preferred
It would be even better to automatically add a/ to the path if the first one was not found and try again
I think the trick will be to figure out at what level we should try to add / (probably when first creating the ListingTable?)
Additional context
No response
take
@alamb I submitted a PR, I don't know if it is what you want.
Thanks @xiedeyantu -- I'll try and review it shortly
New Prs are up:
- https://github.com/apache/datafusion/pull/17364
- https://github.com/xiedeyantu/datafusion/pull/2
I have merged the code you helped me modify, and the CI has passed. Thank you very much! @alamb