delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

Support Directory Based Access

Open zhu-tom opened this issue 6 months ago • 3 comments

This is a proposal to support directory based access in the Delta Sharing Protocol.

Motivation

The Delta Sharing protocol currently grants temporary access to a tables file via its QueryTable API. This operation can be expensive, as the server needs to unpack the delta log, discover parquet data files needed for the query, and generate pre-signed urls for them. A petabyte scale table can have millions of data files, which can put significant load on the server and cause query performance to suffer as well.

Similar in spirit to UC OSS API for GenerateTemporaryTableCredentials, we would like to support sharing tables with Cloud Tokens, which are directory (prefix) based STS tokens that grant temporary read access to the table’s root directory. This approach bypasses the pre-signing workflow, and instead provides direct read only access to the table. The query engines that are capable of processing the delta log get direct access to it, and can optimize query performance by leveraging their custom metadata optimizations, caching and distributed metadata processing.

We propose to add directory based access to the delta sharing protocol to enrich the open sharing ecosystem further.

Protocol Changes

Delta Sharing Capabilities Header

We will introduce a new key accessMode in the delta sharing capabilities header. It will contain a comma separated list of supported access modes for the table. The values allowed are url,dir. url maps to the current pre-signed URL based access that the protocol already supports. dir maps to the new directory based access using temporary directory scoped cloud tokens. The client will call QueryTableMetadata with the accessMode capabilities header that it supports (this could be both url and dir). The server will then send a response with the header of the accessMode capability to use for the table, url or dir. If the server returns both, then the client can decide which to use. If they are incompatible, the server should throw an error. Lack of the accessMode field should default to pre-signed URL based access. This capability can only be used for QueryTableMetadata and is table-specific.

Compatability

Client/Server Server that doesn't recognize the header Server supports accessMode=url Server supports accessMode=dir Server supports accessMode=url,dir
Client that doesn't specify the header Client proceeds with URL based access as before. Server responds with accessMode=url and client proceeds with URL based access as before. Server should throw as it only supports directory based access. Client proceeds with URL based access as before.
Client requests accessMode=url Client proceeds with URL based access as before. Server responds with accessMode=url and client proceeds with URL based access. Server should throw as it only supports directory based access. Server responds with accessMode=url and client proceeds with URL based access.
Client requests accessMode=dir Client should throw as it only supports directory based access. Server responds with accessMode=url and client should throw as it only supports directory based access. Server responds with accessMode=dir and client proceeds with directory based access. Server responds with accessMode=dir and client proceeds with directory based access.
Client requests accessMode=url,dir Client proceeds with URL based access. Server responds with accessMode=url and client proceeds with URL based access. Server responds with accessMode=dir and client proceeds with directory based access. Client can decide which access mode to proceed with.

Query Table Metadata

When the client requests QueryTableMetadata with accessModes containing dir and the server supports directory based access for the table, QueryTableMetadata must return the location of the table for directory based access. In the case that the client does not support directory based access, this field is optional. However, we recommend that this field be included to support recipients with network restrictions to allow these locations to be accessed. auxiliaryLocations is an optional field which represents any auxiliary storage locations for the table. These should be supported in the auxiliaryLocation field of the Generate Temporary Table Credential request body.

Parquet

{
  "protocol": {
    "minReaderVersion": 1
  }
}
{
  "metaData": {
    "id": "f8d5c169-3d01-4ca3-ad9e-7dc3355aedb2",
    "location": "{scheme}://some/path/to/table",
    "auxiliaryLocations": [
       "{scheme}://some/path/1",
       "{scheme}://some/path/2"
    ],
    "format": {
      "provider": "parquet"
    },
    "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"eventTime\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"date\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}",
    "partitionColumns": [
      "date"
    ]
  }
}

Delta

{
  "metaData": {
    "version": 20,
    "size": 123456,
    "numFiles": 5,
    "location": "{scheme}://some/path/to/table",
    "auxiliaryLocations": [
       "{scheme}://some/path/1",
       "{scheme}://some/path/2"
    ],
    "deltaMetadata": {
      "partitionColumns": [
        "date"
      ],
      "format": {
        "provider": "parquet"
      },
      "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"eventTime\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"date\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}",
      "id": "f8d5c169-3d01-4ca3-ad9e-7dc3355aedb2",
      "configuration": {
        "enableChangeDataFeed": "true"
      }
    }
  }
}

Generate Temporary Table Credential

Given that the directory and URL access code paths are distinct, their respective endpoints should remain separate rather than being combined. The response follows the format of GenerateTemporaryTableCredential in UC OSS with the addition of R2Credentials as Delta Sharing supports Cloudflare R2. The location field is also added to introduce a potentially lightweight approach which avoids the metadata call and pre-processing the delta log. It should be the location which the credentials are generated for. Clients that do not support reading from a cloud vendor can throw an error.

HTTP Request Value
Method

POST

Headers

Authorization: Bearer {token}

Optional: Content-Type: application/json; charset=utf-8

Optional: delta-sharing-capabilities: responseformat=delta;readerfeatures=deletionvectors;accessModes=url,prefix

URL

{prefix}/shares/{share}/schemas/{schema}/tables/{table}/temporary-table-credentials

URL Parameters

{share}: The share name to query. It's case-insensitive.

{schema}: The schema name to query. It's case-insensitive.

{table}: The table name to query. It's case-insensitive.

Request Body

The auxiliaryLocation field is optional and specifies the auxiliary location URL path to generate temporary credentials for. If this field is not provided, the response should contain credentials for the table's main location. If the main location is specified the server should still respond with the credential.

{
  "auxiliaryLocation": "{scheme}://some/path/to/table"
}
Response Body

Only one of awsTempCredentials, azureUserDelegationSas, gcpOauthToken, r2Credentials should be defined.

{
  "location": "{scheme}://some/path/to/table",
  "awsTempCredentials": {
    "accessKeyId": "string",
    "secretAccessKey": "string",
    "sessionToken": "string"
  },
  "azureUserDelegationSas": {
    "sasToken": "string"
  },
  "gcpOauthToken": {
    "oauthToken": "string"
  },
  "r2Credentials": {
    "accessKeyId": "string",
    "secretAccessKey": "string",
    "sessionToken": "string"
  },
  "expirationTime": 123456789
}

TemporaryCredentials

Only one of awsTempCredentials, azureUserDelegationSas, gcpOauthToken, r2Credentials should be defined. Their definitions follow Unity Catalog OSS models and APIs.

Name Type Description Notes
location string The directory which the temporary credentials are granted read access to. [required]
awsTempCredentials AwsCredentials [optional]
azureUserDelegationSas AzureUserDelegationSAS [optional]
gcpOauthToken GcpOauthToken [optional]
r2Credentials R2Credentials [optional]
expirationTime Long Server time when the credential will expire, in epoch milliseconds. The API client is advised to cache the credential given this expiration time. [required]

AwsCredentials

Name Type Description Notes
accessKeyId String The access key ID that identifies the temporary credentials. [required]
secretAccessKey String The secret access key that can be used to sign AWS API requests. [required]
sessionToken String The token that users must pass to AWS API to use the temporary credentials. [required]

AzureUserDelegationSAS

Name Type Description Notes
sasToken String Azure SAS Token [required]

GcpOauthToken

Name Type Description Notes
oauthToken String Gcp Token [required]

R2Credentials

Name Type Description Notes
accessKeyId String The access key ID that identifies the temporary credentials. [required]
secretAccessKey String The secret access key associated with the access key. [required]
sessionToken String The generated JWT that users must pass to use the temporary credentials. [required]

Delta Kernel Example

To load a table shared with driectory access using Delta Kernel, follow these steps from the Delta Kernel documentation. The Hadoop configuration needs to be modified to contain the credentials used to authenticate with the cloud provider.

import io.delta.kernel.*;
import io.delta.kernel.defaults.*;
import org.apache.hadoop.conf.Configuration;

String myTablePath = "s3://some/path/to/table";
Configuration hadoopConf = new Configuration();
hadoopConf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");
hadoopConf.set("fs.s3a.access.key", "YOUR_ACCESS_KEY_ID");
hadoopConf.set("fs.s3a.secret.key", "YOUR_SECRET_ACCESS_KEY")
hadoopConf.set("fs.s3a.session.token", "YOUR_SESSION_TOKEN");
Engine myEngine = DefaultEngine.create(hadoopConf);
Table myTable = Table.forPath(myEngine, myTablePath);
...

zhu-tom avatar Oct 10 '25 00:10 zhu-tom

Hi @zhu-tom, Thank you for sharing this proposal—it’s a great step forward! I’d like to contribute with a few personal views and suggestions:

1. Compatibility Section

I was wondering if using the capabilities header to enforce configuration might introduce a breaking change between server and client versions. For instance, a newer client may prefer accessMode=dir, but if it communicates with a server that doesn’t recognize this header, it might not find the location field, potentially causing a failure. Would it make sense for accessMode to follow the same pattern established for responseFormat? Specifically, the client could always provide it as a supported capability, while the server selects one of the options and includes the selected value in the response headers. This way, the client can handle the response body accordingly.

2. Temporary Credentials Flow

Regarding the flow for obtaining temporary credentials, would the server still need to unpack the delta log to serve the Metadata object, right? Providing the location and credentials objects together might be a more lightweight approach for the server, as it could avoid pre-processing delta logs that the client will already handle. However, I understand the importance of maintaining alignment with the UnityCatalog OSS API. Perhaps response headers could be used to indicate the schema and location separately? It would also prevents any extra client-side parsing from location field. For example: * Delta-Table-Location: {location} * Delta-Table-Location-Schema: {schema}


Looking forward to hearing your thoughts! Best regards, Daniel

dmattos-sap avatar Oct 13 '25 16:10 dmattos-sap

Hi @zhu-tom, Thank you for sharing this proposal—it’s a great step forward! I’d like to contribute with a few personal views and suggestions:

1. Compatibility Section

I was wondering if using the capabilities header to enforce configuration might introduce a breaking change between server and client versions. For instance, a newer client may prefer accessMode=dir, but if it communicates with a server that doesn’t recognize this header, it might not find the location field, potentially causing a failure. Would it make sense for accessMode to follow the same pattern established for responseFormat? Specifically, the client could always provide it as a supported capability, while the server selects one of the options and includes the selected value in the response headers. This way, the client can handle the response body accordingly.

Sounds good; we will add this recommendation: Client should specify all the accessModes they can handle, and not just what it prefers. Depending on the server response it can pick the right endpoint to invoke.

2. Temporary Credentials Flow

Regarding the flow for obtaining temporary credentials, would the server still need to unpack the delta log to serve the Metadata object, right? Providing the location and credentials objects together might be a more lightweight approach for the server, as it could avoid pre-processing delta logs that the client will already handle. However, I understand the importance of maintaining alignment with the UnityCatalog OSS API. Perhaps response headers could be used to indicate the schema and location separately? It would also prevents any extra client-side parsing from location field. For example: * Delta-Table-Location: {location} * Delta-Table-Location-Schema: {schema}

Are you suggesting that we return the location in the response of GenerateTemporaryTableCredentials? If so, we can add that.

Looking forward to hearing your thoughts! Best regards, Daniel

chakankardb avatar Oct 15 '25 16:10 chakankardb

Are you suggesting that we return the location in the response of GenerateTemporaryTableCredentials? If so, we can add that.

Exactly, having the location in the response of GenerateTemporaryTableCredentials would be ideal.

dmattos-sap avatar Oct 15 '25 17:10 dmattos-sap

Just a detail here, we are using camelCase for TemporaryCredentials, but on Unity Catalog the model is actually snake_case: TemporaryCredentials. And also, we might need to put "region" for AWS here, since ir might be required by some SDKs.

{
  "aws_temp_credentials": {
    "access_key_id": "string",
    "secret_access_key": "string",
    "session_token": "string",
    "region": "string"
  },
  "azure_user_delegation_sas": {
    "sas_token": "string"
  },
  "gcp_oauth_token": {
    "oauth_token": "string"
  },
  "expiration_time": 0
}

dmattos-sap avatar Dec 10 '25 14:12 dmattos-sap

Hi @dmattos-sap, from my perspective since everything in Delta Sharing OSS Protocol is camel case it makes more sense to stay consistent (even though UC is snake case)

zhu-tom avatar Dec 12 '25 23:12 zhu-tom

Yes, I agree @zhu-tom with camelCase.

I have a question about auxiliaryLocations. What is actually expected to have on this locations? Is it expected that it will be only data partitions of the Delta Table and location will hold _delta_log? Or this auxiliary locations will be kinda of replicas of this table in the sense of having _delta_log and data files in a different place?

I'm just curious here, but what is the cases that the client should use this field when it already have the root table from location?

Thank you.

dmattos-sap avatar Dec 13 '25 01:12 dmattos-sap

@dmattos-sap this addition mainly comes from possibility of absolute paths in the Delta Log which may be different than the root table directory. This field is optional and can be discovered from reading the metadata from the table location.

zhu-tom avatar Dec 17 '25 02:12 zhu-tom