datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Improved experience when remote object store URL does not end in `/`

Open alamb opened this issue 7 months ago • 4 comments

Is your feature request related to a problem or challenge?

  • part of https://github.com/apache/datafusion/issues/13456
  • related to https://github.com/apache/datafusion/issues/16299

I would like to make querying files from remote stores to be easy and a great experience in DataFusion, and datafusion-cli in particular.

While testing https://github.com/apache/datafusion/pull/16300, I tried this command:

datafusion-cli
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet';
Object Store error: Object at location nyc_taxi_rides/data/tripdata_parquet not found: Error performing HEAD https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet in 142.679833ms - Server returned non-2xx status code: 404 Not Found:

This confused me for quite a while as that is a valid url (prefix)

The issue is that the url 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet' does not end in a /. If you add a / it then works great:

> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 1.624 seconds.

BTW this is consistent with a local file system where selecting from a directory that doesn't end in a path works just fine:

-- Write data to `foo` directory:
> copy (values(1)) to 'foo/1.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.044 seconds.

-- Note the location doesn't end in `/` but it works fine
> create external table foo stored as parquet location 'foo';
0 row(s) fetched.
Elapsed 0.022 seconds.

> select * from foo;
+---------+
| column1 |
+---------+
| 1       |
+---------+
1 row(s) fetched.
Elapsed 0.132 seconds.

Describe the solution you'd like

I would like this to be less confusing

Describe alternatives you've considered

Alternate 1: Better Error Message

At the very least we can make the message more explicit ("Not found. Hint: if it is a directory the path should end with /")

Alternate 2: Preferred

It would be even better to automatically add a/ to the path if the first one was not found and try again

I think the trick will be to figure out at what level we should try to add / (probably when first creating the ListingTable?)

Additional context

No response

alamb avatar Jun 06 '25 13:06 alamb

take

xiedeyantu avatar Jun 12 '25 09:06 xiedeyantu

@alamb I submitted a PR, I don't know if it is what you want.

xiedeyantu avatar Jun 13 '25 14:06 xiedeyantu

Thanks @xiedeyantu -- I'll try and review it shortly

alamb avatar Jun 13 '25 16:06 alamb

Thanks @xiedeyantu -- I'll try and review it shortly

Thanks a lot! @alamb

xiedeyantu avatar Jun 13 '25 23:06 xiedeyantu

New Prs are up:

  • https://github.com/apache/datafusion/pull/17364
  • https://github.com/xiedeyantu/datafusion/pull/2

alamb avatar Sep 05 '25 19:09 alamb

I have merged the code you helped me modify, and the CI has passed. Thank you very much! @alamb

xiedeyantu avatar Sep 06 '25 00:09 xiedeyantu