delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

Support for delta sharing profile via options

Open stevenayers opened this issue 1 year ago • 13 comments

Add support for delta sharing profile via options (resolves #483).

Task List:

  • [x] Add options to DeltaSharingOptions
  • [x] Unit tests for DeltaSharingOptions
  • [x] DeltaSharingProfile from DeltaSharingOptions
  • [x] Unit tests for DeltaSharingProfileProvider
  • [x] Conditional logic to create from options
  • [x] Unit tests for conditional logic
  • [x] Integration tests
  • [x] Include Examples and Documentation

Examples

// Bearer Token
val df = spark.read.format("deltaSharing")
  .option("shareCredentialsVersion", "1")
  .option("endpoint", "https://example.com/delta-sharing")
  .option("bearerToken", "foobar")
  .option("expirationTime", "2022-01-01T00:00:00Z")
  .load("<share-name>.<schema-name>.<table-name>")
  .limit(10)

// OAuth
val df = spark.read.format("deltaSharing")
  .option("shareCredentialsVersion", "2")
  .option("shareCredentialsType", "oauth_client_credentials")
  .option("endpoint", "https://example.com/delta-sharing")
  .option("tokenEndpoint", "https://example.com/delta-sharing/oauth/token")
  .option("clientId", "abc")
  .option("clientSecret", "xyz")
  .load("<share-name>.<schema-name>.<table-name>")
  .limit(10)
# Bearer Token
df = (
    spark.read.format("deltaSharing")
    .option("shareCredentialsVersion", "1")
    .option("endpoint", "https://example.com/delta-sharing")
    .option("bearerToken", "foobar")
    .option("expirationTime", "2022-01-01T00:00:00Z")
    .load("<share-name>.<schema-name>.<table-name>")
    .limit(10)
)

# OAuth
df = (
    spark.read.format("deltaSharing")
    .option("shareCredentialsVersion", "2")
    .option("shareCredentialsType", "oauth_client_credentials")
    .option("endpoint", "https://example.com/delta-sharing")
    .option("tokenEndpoint", "https://example.com/delta-sharing/oauth/token")
    .option("clientId", "abc")
    .option("clientSecret", "xyz")
    .load("<share-name>.<schema-name>.<table-name>")
    .limit(10)
)

# load_as_spark or load_table_changes_as_spark
import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile

profile = DeltaSharingProfile( # also supports OAuth
    share_credentials_version=1,
    endpoint="https://example.com/delta-sharing",
    bearer_token="foobar",
    expiration_time="2022-01-01T00:00:00Z",
)

df = delta_sharing.load_as_spark(
    f"<share-name>.<schema-name>.<table-name>",
    version=version,
    timestamp=timestamp,
    delta_sharing_profile=profile
)

stevenayers avatar Sep 14 '24 07:09 stevenayers

@linzhou-db

stevenayers avatar Sep 14 '24 11:09 stevenayers

Hi @stevenayers it seems that you are aware of the previous effort https://github.com/delta-io/delta-sharing/pull/103, and it seems that we didn't want to proceed with it because it introduces secretes in the code. I'll discuss this with our team again and get back to you.

linzhou-db avatar Sep 18 '24 21:09 linzhou-db

Hi @stevenayers it seems that you are aware of the previous effort #103, and it seems that we didn't want to proceed with it because it introduces secretes in the code. I'll discuss this with our team again and get back to you.

Hi @linzhou-db, I understand. However, I think some of the feedback provided by the community in #103, which counters that decision, is still very valid. Whether you are reading the secrets from a file or passing them in via options, your library still has "secrets in the code." However, the current method forces us to hard-code our credentials in a plain-text file on our filesystem (which is just as dangerous as storing them in a notebook).

Your colleague's comment here about tempfiles also will not work in Python. You cannot create a tempfile in python which is then used by the underlying Scala code. A true temp file is stored in memory and never flushed to the filesystem, which is required when working across two different processes such as PySpark which communicates with the underlying JVM.

If we look at Databricks' Security Best Practices:

Store and use secrets securely Integrating with heterogeneous systems requires managing a potentially large set of credentials and safely distributing them across an organization. Instead of directly entering your credentials into a notebook, use Databricks secrets to store your credentials and reference them in notebooks and jobs. Databricks secret management allows users to use and share credentials within Databricks securely. You can also choose to use a third-party secret management service, such as AWS Secrets Manager or a third-party secret manager.

A dedicated secrets manager should manage Secrets Management. These secret managers exist so that as developers, we do not need to store secrets in plain-text files or hardcode them as variables; instead, they can be stored in memory and passed into the logic that requires them. 😄

Doing this requires an interface such as the one proposed in this PR.

Please reach out to some of the Infrastructure & Security professionals within your organization for their opinions; I think they would share a similar sentiment to myself, @ssimeonov and @shcent.

stevenayers avatar Sep 19 '24 05:09 stevenayers

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

stevenayers avatar Sep 19 '24 05:09 stevenayers

Completely agree with @stevenayers. Having to store the authentication token in a file at all is dangerous. AWS, Azure and Databricks (and I'm sure any other cloud provider not mentioned) all come with secrets managers so that this should not be needed.

Using a temp file in Python is still awkward, especially on databricks where the path has a tendency to change from a Notebook into pure pyspark, and there is always the risk that this is not tidied correctly. Also as @stevenayers mentions in memory versions do not work.

shcent avatar Sep 19 '24 07:09 shcent

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

@stevenayers and @shcent thanks for sharing your thoughts, super helpful for me to get more context on this.

I just want to reply acking that we've read your comments and are actively discussing this. Will get back to you soon.

linzhou-db avatar Sep 19 '24 16:09 linzhou-db

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

@stevenayers and @shcent thanks for sharing your thoughts, super helpful for me to get more context on this.

I just want to reply acking that we've read your comments and are actively discussing this. Will get back to you soon.

Thanks @linzhou-db!

stevenayers avatar Sep 19 '24 16:09 stevenayers

Can you please merge this and release new version of lib? Our project is waiting for this change🙂

DanyMariaLee avatar Oct 03 '24 01:10 DanyMariaLee

any news on this one?

netf avatar Oct 03 '24 07:10 netf

@linzhou-db @chakankardb @zsxwing @mateiz any further thoughts on this change? thanks :)

stevenayers avatar Oct 14 '24 16:10 stevenayers

image

stevenayers avatar Oct 26 '24 11:10 stevenayers

Any progress on this?

glima03 avatar Feb 10 '25 14:02 glima03

@glima03 I had a call with Databricks last week about this feature to try and get some eyes on it. Hopefully we get some movement soon.

stevenayers avatar Feb 10 '25 16:02 stevenayers

I'd like to double-click on this PR. @stevenayers after your last comment i see this was closed more recently. Any reason for that? if so was there a different PR that made it in? My team would love to have this feature as we too cannot pass creds via files and have our own secrets management system to integrate with and push creds via Dataframe options.

prodeezy avatar Oct 24 '25 22:10 prodeezy

I'd like to double-click on this PR. @stevenayers after your last comment i see this was closed more recently. Any reason for that? if so was there a different PR that made it in? My team would love to have this feature as we too cannot pass creds via files and have our own secrets management system to integrate with and push creds via Dataframe options.

No progress I'm afraid

stevenayers-bge avatar Oct 29 '25 09:10 stevenayers-bge

Can we split this into two separate PRs: one for spark and one for python?

chakankardb avatar Oct 30 '25 07:10 chakankardb

@chakankardb if you'd like to do it, please feel free.

stevenayers avatar Oct 30 '25 12:10 stevenayers

@chakankardb @stevenayers i can take up splitting it into two PRs.

prodeezy avatar Oct 31 '25 21:10 prodeezy