datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add `ObjectStore` support via SQL

Open matthewmturner opened this issue 3 years ago • 7 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

I am working towards making datafusion-cli a powerful tool to use locally for doing ad-hoc data analysis. The first step for that was #1875 which enables defining a local "database" that runs on startup with a .datafusionrc file. As a second step, I would like to be able to connect to object stores, such as S3, just from SQL. That will of course require adding s3 as a feature to datafusion-cli but that feature is useless unless ObjectStores can be registered. Below is the current behaviour:

❯ CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 's3://bucket/t.csv';
Internal("No suitable object store found for s3")

Describe the solution you'd like A clear and concise description of what you want to happen.

I would like to be able to register a ObjectStore just from SQL. Given that ObjectStore is a DataFusion concept I was thinking that we can add a function such as register_object_store, rather than having a SQL statement.

So it would look something like

Default credentials

❯   register_object_store('s3');

Minio

❯   register_object_store('s3', ACCESS_KEY, SECRET_KEY, PROVIDER, ENDPOINT);

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

matthewmturner avatar Mar 05 '22 17:03 matthewmturner

@seddonm1 @yjshen @houqp FYI - in case you have thoughts on this.

matthewmturner avatar Mar 05 '22 17:03 matthewmturner

actually, im not sure how well those parameters in register_object_store will generalize to other ObjectStore besides s3. so now im not sure if a general function like that could be used.

matthewmturner avatar Mar 05 '22 17:03 matthewmturner

maybe my objective could be achieved with some command line options instead. for example:

Default credentials

$ datafusion-cli --object-store s3

Minio

$ datafusion-cli --object-store s3 --access-key KEY --secret-key ABC --provider PROVIDER --endpoint ENDPOINT

@houqp @yjshen @seddonm1 do you have a view on whether ObjectStore registration can be done via SQL or if this should be part of datafusion-cli?

matthewmturner avatar Mar 06 '22 01:03 matthewmturner

I think it can be done through both because secret key credentials and endpoint can be provided through environment variables as well. In this case, user will only need to provide the s3 path in the SQL query.

houqp avatar Mar 09 '22 05:03 houqp

@matthewmturner any progress on this one? If you are not working on it still, I would like to take a stab at it

turbo1912 avatar Sep 16 '22 23:09 turbo1912

I think this repo is largely deprecated in favour of https://github.com/apache/arrow-rs/tree/master/object_store

seddonm1 avatar Sep 16 '22 23:09 seddonm1

@matthewmturner any progress on this one? If you are not working on it still, I would like to take a stab at it

@turbo1912 Haven't been able to work on this, go for it!

matthewmturner avatar Sep 17 '22 00:09 matthewmturner