delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Can't read a Delta table from Azure Unity Catalog

Open MigQ2 opened this issue 2 years ago • 8 comments

Environment

  • Linux
  • python 3.10.10
  • deltalake==0.10.2

Environment:

  • Cloud provider: Azure Databricks

Bug

What happened:

I am trying to replicate this example from the documentation to read a Delta Table from Databricks Unity Catalog:

from deltalake import DataCatalog, DeltaTable
catalog_name = 'main'
schema_name = 'db_schema'
table_name = 'db_table'
data_catalog = DataCatalog.UNITY
dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, data_catalog_id=catalog_name, database_name=schema_name, table_name=table_name)

but I get the following error:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 
retries: error sending request for url 
(http://<SOME-IP-ADDRESS>/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.
com): error trying to connect: tcp connect error: Connection refused (os error 111)

Stacktrace:

 /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:285 in from_data_catalog     │
│                                                                                                  │
│   282 │   │   │   database_name=database_name,                                                   │
│   283 │   │   │   table_name=table_name,                                                         │
│   284 │   │   )                                                                                  │
│ ❱ 285 │   │   return cls(                                                                        │
│   286 │   │   │   table_uri=table_uri, version=version, log_buffer_size=log_buffer_size          │
│   287 │   │   )                                                                                  │
│   288                                                                                            │
│                                                                                                  │
│ /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:246 in __init__              │
│                                                                                                  │
│   243 │   │                                                                                      │
│   244 │   │   """                                                                                │
│   245 │   │   self._storage_options = storage_options                                            │
│ ❱ 246 │   │   self._table = RawDeltaTable(                                                       │
│   247 │   │   │   str(table_uri),                                                                │
│   248 │   │   │   version=version,                                                               │
│   249 │   │   │   storage_options=storage_options, 

What you expected to happen:

I wish I could read the Delta Table

More details:

  • I can read from the storage account where the data is located using other libraries in the same python interpreter so I don't think it's a firewall problem
  • The same host and token work perfectly fine in the same interpreter to read data from the same Unity Catalog table using databricks-connect, so the URL and token are valid

MigQ2 avatar Sep 14 '23 23:09 MigQ2

I wish I could read the Delta Table

:laughing: me too

The Unity support in delta-rs is young I would say. I have access to a Unity environment but not an Azure specific Databricks+Unity environment. I'm not honestly sure how to start here, I assume the URL that was spit out to you is at a legitimate hostname that might otherwise respond to connections from wherever you are running this Python code?

rtyler avatar Sep 15 '23 07:09 rtyler

It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)

I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).

In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).

A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).

I guess this workaround may also work in Azure with a right secret/key/token/...

r3stl355 avatar Oct 08 '23 19:10 r3stl355

Actually, this looks like an expected behavior, mentioned in https://github.com/delta-io/delta-rs/pull/1331#issuecomment-1581557227

r3stl355 avatar Oct 09 '23 08:10 r3stl355

@r3stl355 This is a topic I have recently discussed with @MrPowers and some of the Databricks team. I don't have a great solution to offer at the moment other than "we're working on figuring this out" :smile:

rtyler avatar Oct 09 '23 18:10 rtyler

@rtyler maybe you could include me in those future conversations given I work for Databricks atm :grin

r3stl355 avatar Oct 09 '23 21:10 r3stl355

It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)

I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).

In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).

A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).

I guess this workaround may also work in Azure with a right secret/key/token/...

The Unity Catalog in my org is becoming a huge roadblock to use Delta-RS in a broad scope outside of internal team use. No one wants to provide read credentials anymore to the storage which obliterates the use of Delta-RS within this context. Besides the possible vendor lock-in 😄, it makes interoperability with databricks not ideal, currently for any data reads we revert back to databricks-sql connector.

ion-elgreco avatar Oct 10 '23 16:10 ion-elgreco

I have the same problem:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)

The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http

But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL: image

davidvesp avatar Oct 11 '23 14:10 davidvesp

I have the same problem:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)

The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http

But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL: image

Interesting, so UC by design gives a token to read the data from storage. Then this token should just be returned when you query databricks REST APIs get table

ion-elgreco avatar Oct 11 '23 15:10 ion-elgreco

Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?

https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0

tunayokumus avatar Nov 11 '24 10:11 tunayokumus

Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?

https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0

Sure, feel free to take a jab at it

ion-elgreco avatar Nov 11 '24 10:11 ion-elgreco

Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement? https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0

Sure, feel free to take a jab at it

Well this could be good excuse for me to learn Rust indeed, I might do that. But I'm not yet sure if the credential vending is the key enabler here. I recently saw the issue you created on the unity-catalog-python repo https://github.com/unitycatalog/unitycatalog-python/issues/4 Is this related to this issue?

tunayokumus avatar Dec 08 '24 22:12 tunayokumus

This is supported now

ion-elgreco avatar Feb 22 '25 10:02 ion-elgreco

For the record, I'm still getting the Generic MicrosoftAzure error: Error performing token request on http://169.254.169.254 when trying to open a Delta Table on Azure Databricks without a token. This is with deltalake 0.25.2. Should I open a new issue?

astrojuanlu avatar Feb 28 '25 14:02 astrojuanlu

For the record, I'm still getting the Generic MicrosoftAzure error: Error performing token request on http://169.254.169.254 when trying to open a Delta Table on Azure Databricks without a token. This is with deltalake 0.25.2. Should I open a new issue?

Go ahead yeah

ion-elgreco avatar Feb 28 '25 14:02 ion-elgreco