delta-sharing
delta-sharing copied to clipboard
load_as_spark using config as variable instead of file
Hi all,
I want to use load_as_spark
, but instead of saving the config credentials as a file, I want to pull them from a secrets manager (Azure Vault, AWS SecretsManager, Databricks Secret) instead of storing the file on the filesystem. I want to avoid saving the json config to the filesystem for security reasons.
For example:
delta_sharing_config = dbutils.secrets.get(scope="delta-sharing", key="my-config")
df = delta_sharing.load_as_spark(delta_sharing_config + '#<share name>.<schema>.<table name>')
Is this possible?
Thanks
@stevenayers-bge: Would you be satisfied with building the json object while bringing the secrets from key vault and any other configuration from application resource configuration or even from the keyvault?
@aimtsou initializing the rest client works fine, for example:
import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile
# GET SECRET VALUE
delta_sharing_config = dbutils.secrets.get(scope="delta-sharing", key="my-config")
# PASS SECRET CONFIG WITHOUT SAVING TO FILE
profile = DeltaSharingProfile.from_json(delta_sharing_config)
client = delta_sharing.SharingClient(profile)
# List tables available in share
client.list_all_tables()
The issue is when you go to read the data into a dataframe. You cannot use SharingClient
or DataSharingProfile
to authenticate when reading the dataframe. You have to pass the file in the load_as_spark
function URL argument:
import delta_sharing
share_file = "/tmp/my-secret-config-stored-in-a-file" # security problem!
df = delta_sharing.load_as_spark(share_file + '#nep_test.reference.settlement_period_calendar')
@aimtsou @linzhou-db this old PR would solve the issue https://github.com/delta-io/delta-sharing/pull/103
@stevenayers,
Now i get your point. Yes the load_as_spark function does not support SharingClient, so you are proposing something different from me, to pass the whole JSON stored in secure storage as a profile to Spark, I proposed to build the JSON object in python which is closer to @zsxwing suggests. I see 2 different implementations to be honest.
My question is what is your incentive for storing the whole file is a secret store? Usually we retrieve only the secret value. I mean in your case the person who has access to this databricks workspace has already access to the whole file since it is in the secret store.
@aimtsou i don't mind how it's done 🙂 Im not sure your PR will help though. load_as_spark doesn't interact with the Python SharingClient.
load_as_spark passes the config file path to the JVM, and the file is read and parsed within the delta sharing scala library.
@stevenayers,
Now i get your point. Yes the load_as_spark function does not support SharingClient, so you are proposing something different from me, to pass the whole JSON stored in secure storage as a profile to Spark, I proposed to build the JSON object in python which is closer to @zsxwing suggests. I see 2 different implementations to be honest.
My question is what is your incentive for storing the whole file is a secret store? Usually we retrieve only the secret value. I mean in your case the person who has access to this databricks workspace has already access to the whole file since it is in the secret store.
I don't have an incentive to store the whole file in a secret manager, but it is often easier to store the whole json config in one secret.
It would be equally secure if you only stored the bearerToken in a secrets manager, but then you're storing some of the config on a filesystem, and some in a secrets manager, which is a bit messy.
All I'm proposing is that we can pass profile configuration via parameters, rather than passing in a filepath. So something like:
share_endpoint = ... pulled in from a secretsmanager or whatever
share_token = ... pulled in from a secretsmanager or whatever
delta_sharing.load_as_spark(
share_name='<share-name>',
schema_name='<schema name>',
table_name='my_table',
endpoint=share_endpoint,
bearer_token=share_token
)
# or pass a client object?
client: SharingClient = ...
delta_sharing.load_as_spark(client=client, share_name=.... etc)
@aimtsou does that make sense?
For me yes and as you said yes there is only a df.load(url)
in load_as_spark
so indeed it needs to be done on the scala side.
This is causing me a whole world of pain of Databricks. Having to write the config into a temporary file that sometimes needs a /dbfs prefix and sometimes doesn't. Completely agree that a solution that does not rely on a file base url would be much better.
@linzhou-db @moderakh @zhu-tom @pranavsuku-db do you have any time to look at this? I can also raise a PR with the changes if that helps?
I know it's available in the rust client, and I've seen you're using maturin/pyo3 so it might already be possible, I'm just not very familiar with how rust & python interact.
Please let me know if there's anything I can do to help 🙏
@stevenayers-bge sure feel free to send out the PR.