delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

load_as_spark using config as variable instead of file

Open stevenayers-bge opened this issue 10 months ago • 10 comments

Hi all,

I want to use load_as_spark, but instead of saving the config credentials as a file, I want to pull them from a secrets manager (Azure Vault, AWS SecretsManager, Databricks Secret) instead of storing the file on the filesystem. I want to avoid saving the json config to the filesystem for security reasons.

For example:

delta_sharing_config = dbutils.secrets.get(scope="delta-sharing", key="my-config")

df = delta_sharing.load_as_spark(delta_sharing_config + '#<share name>.<schema>.<table name>')

Is this possible?

Thanks

stevenayers-bge avatar Apr 24 '24 16:04 stevenayers-bge

@stevenayers-bge: Would you be satisfied with building the json object while bringing the secrets from key vault and any other configuration from application resource configuration or even from the keyvault?

aimtsou avatar Apr 25 '24 09:04 aimtsou

@aimtsou initializing the rest client works fine, for example:

import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile

# GET SECRET VALUE
delta_sharing_config = dbutils.secrets.get(scope="delta-sharing", key="my-config")

# PASS SECRET CONFIG WITHOUT SAVING TO FILE
profile = DeltaSharingProfile.from_json(delta_sharing_config)
client = delta_sharing.SharingClient(profile)

# List tables available in share
client.list_all_tables()

The issue is when you go to read the data into a dataframe. You cannot use SharingClient or DataSharingProfile to authenticate when reading the dataframe. You have to pass the file in the load_as_spark function URL argument:

import delta_sharing

share_file = "/tmp/my-secret-config-stored-in-a-file" # security problem!

df = delta_sharing.load_as_spark(share_file + '#nep_test.reference.settlement_period_calendar')

stevenayers avatar Apr 25 '24 10:04 stevenayers

@aimtsou @linzhou-db this old PR would solve the issue https://github.com/delta-io/delta-sharing/pull/103

stevenayers avatar Apr 25 '24 10:04 stevenayers

@stevenayers,

Now i get your point. Yes the load_as_spark function does not support SharingClient, so you are proposing something different from me, to pass the whole JSON stored in secure storage as a profile to Spark, I proposed to build the JSON object in python which is closer to @zsxwing suggests. I see 2 different implementations to be honest.

My question is what is your incentive for storing the whole file is a secret store? Usually we retrieve only the secret value. I mean in your case the person who has access to this databricks workspace has already access to the whole file since it is in the secret store.

aimtsou avatar Apr 25 '24 11:04 aimtsou

@aimtsou i don't mind how it's done 🙂 Im not sure your PR will help though. load_as_spark doesn't interact with the Python SharingClient.

load_as_spark passes the config file path to the JVM, and the file is read and parsed within the delta sharing scala library.

stevenayers-bge avatar Apr 25 '24 11:04 stevenayers-bge

@stevenayers,

Now i get your point. Yes the load_as_spark function does not support SharingClient, so you are proposing something different from me, to pass the whole JSON stored in secure storage as a profile to Spark, I proposed to build the JSON object in python which is closer to @zsxwing suggests. I see 2 different implementations to be honest.

My question is what is your incentive for storing the whole file is a secret store? Usually we retrieve only the secret value. I mean in your case the person who has access to this databricks workspace has already access to the whole file since it is in the secret store.

I don't have an incentive to store the whole file in a secret manager, but it is often easier to store the whole json config in one secret.

It would be equally secure if you only stored the bearerToken in a secrets manager, but then you're storing some of the config on a filesystem, and some in a secrets manager, which is a bit messy.

All I'm proposing is that we can pass profile configuration via parameters, rather than passing in a filepath. So something like:

share_endpoint = ... pulled in from a secretsmanager or whatever
share_token = ... pulled in from a secretsmanager or whatever

delta_sharing.load_as_spark(
    share_name='<share-name>',
    schema_name='<schema name>',
    table_name='my_table',
    endpoint=share_endpoint,
    bearer_token=share_token
)

# or pass a client object?
client: SharingClient = ...
delta_sharing.load_as_spark(client=client, share_name=.... etc)

@aimtsou does that make sense?

stevenayers-bge avatar Apr 25 '24 11:04 stevenayers-bge

For me yes and as you said yes there is only a df.load(url) in load_as_spark so indeed it needs to be done on the scala side.

aimtsou avatar Apr 25 '24 11:04 aimtsou

This is causing me a whole world of pain of Databricks. Having to write the config into a temporary file that sometimes needs a /dbfs prefix and sometimes doesn't. Completely agree that a solution that does not rely on a file base url would be much better.

shcent avatar Jun 28 '24 13:06 shcent

@linzhou-db @moderakh @zhu-tom @pranavsuku-db do you have any time to look at this? I can also raise a PR with the changes if that helps?

I know it's available in the rust client, and I've seen you're using maturin/pyo3 so it might already be possible, I'm just not very familiar with how rust & python interact.

Please let me know if there's anything I can do to help 🙏

stevenayers-bge avatar Sep 07 '24 09:09 stevenayers-bge

@stevenayers-bge sure feel free to send out the PR.

linzhou-db avatar Sep 07 '24 19:09 linzhou-db