datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat: allow object store registration from datafusion-cli

Open turbo1912 opened this issue 2 years ago • 7 comments

Which issue does this PR close?

Closes #3424.

Rationale for this change

This change provides an easy way for datafusion cli users to register object stores.

(inspiration from https://github.com/datafusion-contrib/datafusion-cookbook)

What changes are included in this PR?

Added environment variable support for registering s3 and gcp object stores. Currently users can only specify bucket name from the CLI and the rest of the configuration is setup from environment variables.

Are there any user-facing changes?

Datafusion cli users can now register an object store when they are starting the datafusion cli

datafusion cli --object-store s3 --bucket-name name

and then create external tables from object stores in the interactive shell:

create external table unicorns stored as parquet location 's3://my_bucket/lineitem/'

turbo1912 avatar Sep 19 '22 23:09 turbo1912

This PR is super exciting! THank you @turbo1912 and @Jimexist

alamb avatar Sep 20 '22 11:09 alamb

I think that rather than any options, the support for object_store should be invisible to the user. So just running this SQL from datafusion-cli should be enough to trigger the cli to register the s3 objectstore:

create external table unicorns stored as parquet location 's3://my_bucket/lineitem/';

You could do that by creating the logical plan from the user's given SQL, then only registering an object store if they're creating an external location that's in the cloud. Something like:

    let logical_plan = ctx.create_logical_plan(sql)?;
    match logical_plan {
        LogicalPlan::CreateExternalTable(external) => {
            if external.location.to_lowercase().starts_with("s3://") {
                // use https://docs.rs/object_store/latest/object_store/path/struct.Path.html
                // parse the path, find the bucket
                // register the object store 
            }
        }
        _ => {},
    }
    // continue on and execute the logical plan

kmitchener avatar Sep 20 '22 11:09 kmitchener

I think that rather than any options, the support for object_store should be invisible to the user. So just running this SQL from datafusion-cli should be enough to trigger the cli to register the s3 objectstore:

create external table unicorns stored as parquet location 's3://my_bucket/lineitem/';

You could do that by creating the logical plan from the user's given SQL, then only registering an object store if they're creating an external location that's in the cloud. Something like:

    let logical_plan = ctx.create_logical_plan(sql)?;
    match logical_plan {
        LogicalPlan::CreateExternalTable(external) => {
            if external.location.to_lowercase().starts_with("s3://") {
                // use https://docs.rs/object_store/latest/object_store/path/struct.Path.html
                // parse the path, find the bucket
                // register the object store 
            }
        }
        _ => {},
    }
    // continue on and execute the logical plan

Yes @kmitchener! I thought of this as well, but then wasn't sure if every object-store type can be inferred from the url so took the less risky approach, that doesn't close any doors for the future. For example, can file://localhost mean multiple things?

That being said, looks like there is some object-store code already that makes some assumptions in regards to the url.

I am happy to go with this approach if there is interest! @alamb, @tustvold and @Jimexist any thoughts?

turbo1912 avatar Sep 20 '22 18:09 turbo1912

I think it would be awesome to support this out of the box, and ObjectStoreRegistry should be able to handle this, after all a non-trivial amount of effort went into the design of ObjectStoreUrl to make sure this works. In particular I think it should just be a case of providing an ObjectStoreProvider that can create the various different schema.

https://github.com/apache/arrow-rs/issues/2304 may also be relevant, FYI @roeap

tustvold avatar Sep 20 '22 19:09 tustvold

In delta-rs I have been playing around with some of this, in the hopes to eventually get to a point to move this upstream into object store. Specifically, I took the OjectStoreUrl and gave it some "awareness" of what service it is referencing (s3 etc ..). (https://github.com/delta-io/delta-rs/blob/0fdaeed734b036d2bd7734eb3c4b3e56680dffff/rust/src/builder/mod.rs#L281-L407)

Also, I added parsing for some well known uris - especially azure has many different variants :) (https://github.com/delta-io/delta-rs/blob/0fdaeed734b036d2bd7734eb3c4b3e56680dffff/rust/src/builder/mod.rs#L428-L537)

Last but not least, we had several users who wanted to manage connections to several stores, so we added some logic to process property bags with several commonly used aliases for some of the config keys. and making sure passed config always takes precedence over config taken from the environment. (https://github.com/delta-io/delta-rs/blob/main/rust/src/builder/azure.rs)

Not sure how much of this - if anything :) - might be useful here or is "worthy" or moving upstream :).

roeap avatar Sep 20 '22 19:09 roeap

ObjectStoreProvider works great, thanks for the suggestion @tustvold! I gave it a try, one thing that I am a little worried about is that there is no way for ObjectStoreProvider to communicate errors, for example if the url for a s3 object store is malformed provider would just return None and the ObjectStoreRegistry would silently not find the suitable object store.

turbo1912 avatar Sep 22 '22 06:09 turbo1912

I don't think there would be any objections to changing ObjectStoreProvider to return an error, after all if it returns None it just gets converted to an error anyway by ObjectStoreRegistry::get_by_url. I'll get a quick PR up to do this

tustvold avatar Sep 22 '22 09:09 tustvold

Thank you for this, this is really cool functionality to have out of the box

tustvold avatar Sep 23 '22 19:09 tustvold

Benchmark runs are scheduled for baseline = 49b9c675abd2be96a8426038c128bb46acc76240 and contender = b6252771248da117dd3a5976d134a66a4bb60431. b6252771248da117dd3a5976d134a66a4bb60431 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Sep 23 '22 19:09 ursabot

+1. Thanks, really nice!

timvw avatar Sep 24 '22 17:09 timvw