delta-rs
delta-rs copied to clipboard
Can't read a Delta table from Azure Unity Catalog
Environment
- Linux
- python 3.10.10
- deltalake==0.10.2
Environment:
- Cloud provider: Azure Databricks
Bug
What happened:
I am trying to replicate this example from the documentation to read a Delta Table from Databricks Unity Catalog:
from deltalake import DataCatalog, DeltaTable
catalog_name = 'main'
schema_name = 'db_schema'
table_name = 'db_table'
data_catalog = DataCatalog.UNITY
dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, data_catalog_id=catalog_name, database_name=schema_name, table_name=table_name)
but I get the following error:
OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10
retries: error sending request for url
(http://<SOME-IP-ADDRESS>/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.
com): error trying to connect: tcp connect error: Connection refused (os error 111)
Stacktrace:
/home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:285 in from_data_catalog │
│ │
│ 282 │ │ │ database_name=database_name, │
│ 283 │ │ │ table_name=table_name, │
│ 284 │ │ ) │
│ ❱ 285 │ │ return cls( │
│ 286 │ │ │ table_uri=table_uri, version=version, log_buffer_size=log_buffer_size │
│ 287 │ │ ) │
│ 288 │
│ │
│ /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:246 in __init__ │
│ │
│ 243 │ │ │
│ 244 │ │ """ │
│ 245 │ │ self._storage_options = storage_options │
│ ❱ 246 │ │ self._table = RawDeltaTable( │
│ 247 │ │ │ str(table_uri), │
│ 248 │ │ │ version=version, │
│ 249 │ │ │ storage_options=storage_options,
What you expected to happen:
I wish I could read the Delta Table
More details:
- I can read from the storage account where the data is located using other libraries in the same python interpreter so I don't think it's a firewall problem
- The same host and token work perfectly fine in the same interpreter to read data from the same Unity Catalog table using databricks-connect, so the URL and token are valid
I wish I could read the Delta Table
:laughing: me too
The Unity support in delta-rs is young I would say. I have access to a Unity environment but not an Azure specific Databricks+Unity environment. I'm not honestly sure how to start here, I assume the URL that was spit out to you is at a legitimate hostname that might otherwise respond to connections from wherever you are running this Python code?
It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)
I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).
In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).
A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).
I guess this workaround may also work in Azure with a right secret/key/token/...
Actually, this looks like an expected behavior, mentioned in https://github.com/delta-io/delta-rs/pull/1331#issuecomment-1581557227
@r3stl355 This is a topic I have recently discussed with @MrPowers and some of the Databricks team. I don't have a great solution to offer at the moment other than "we're working on figuring this out" :smile:
@rtyler maybe you could include me in those future conversations given I work for Databricks atm :grin
It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)
I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That
<SOME-IP-ADDRESS>is169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).
A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEYandAWS_SESSION_TOKENin my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I getThe table's minimum reader version is 2 but deltalake only supports up to version 1when I try toto_pyarrow_tablebut that's a different story).I guess this workaround may also work in Azure with a right secret/key/token/...
The Unity Catalog in my org is becoming a huge roadblock to use Delta-RS in a broad scope outside of internal team use. No one wants to provide read credentials anymore to the storage which obliterates the use of Delta-RS within this context. Besides the possible vendor lock-in 😄, it makes interoperability with databricks not ideal, currently for any data reads we revert back to databricks-sql connector.
I have the same problem:
OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)
The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http
But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL:
I have the same problem:
OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)
The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http
But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL:
Interesting, so UC by design gives a token to read the data from storage. Then this token should just be returned when you query databricks REST APIs get table
Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?
https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0
Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?
https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0
Sure, feel free to take a jab at it
Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement? https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0
Sure, feel free to take a jab at it
Well this could be good excuse for me to learn Rust indeed, I might do that. But I'm not yet sure if the credential vending is the key enabler here. I recently saw the issue you created on the unity-catalog-python repo https://github.com/unitycatalog/unitycatalog-python/issues/4 Is this related to this issue?
This is supported now
For the record, I'm still getting the Generic MicrosoftAzure error: Error performing token request on http://169.254.169.254 when trying to open a Delta Table on Azure Databricks without a token. This is with deltalake 0.25.2. Should I open a new issue?
For the record, I'm still getting the
Generic MicrosoftAzure error: Error performing token requestonhttp://169.254.169.254when trying to open a Delta Table on Azure Databricks without a token. This is with deltalake 0.25.2. Should I open a new issue?
Go ahead yeah