delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Simple delta write in Fabric notebook failing with SSL error

Open bcdobbs opened this issue 2 years ago • 19 comments
trafficstars

Delta-rs version: Python 0.12.0 Cloud provider: Microsoft (UK South) Environment: Fabric Notebook


Bug

What happened: When trying to write a pandas dataframe to a delta table in Microsoft Fabric it fails with an SSL error:

OSError: Generic MicrosoftAzure error: response error "request error", after 10 retries: error sending request for url (https://onelake.blob.fabric.microsoft.com/xxx/yyy.Lakehouse/Tables/Test/_delta_log/_last_checkpoint): error trying to connect: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1889: (self-signed certificate)

How to reproduce it: Installed deltalake in fabric library management and then ran the following in a notebook:

import pandas as pd
from deltalake.writer import write_deltalake
from trident_token_library_wrapper import PyTridentTokenLibrary

token = PyTridentTokenLibrary.get_access_token("storage")

TablePath = "abfss://[email protected]/yyy.Lakehouse/Tables/Test"
aadToken = PyTridentTokenLibrary.get_access_token("storage")

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

write_deltalake(TablePath, df, storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"})

bcdobbs avatar Nov 04 '23 20:11 bcdobbs

@bcdobbs is there any chance that in the environment where the notebook is running, traffic goes through some appliance with SSL inspection (I don't know much about Fabric notebooks)? https://onelake.blob.fabric.microsoft.com/ has a valid SSL certificate but the error says it has a self-signed one which may happen if SSL inspection is used.

r3stl355 avatar Nov 05 '23 17:11 r3stl355

btw, an easy way to answer my question would be running something like this in the notebook.

import requests
requests.get('https://onelake.blob.fabric.microsoft.com/').content

If you get Healthy then my earlier theory is wrong, but if you get an error - then it holds.

I was trying to replicate the issue on my side but I am getting a different error when I try to from deltalake.writer import write_deltalake

Error: /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages/pyarrow/libarrow_acero.so.1200: undefined symbol: _ZN5arrow7compute4callESsSt6vectorINS0_10ExpressionESaIS2_EESt10shared_ptrINS0_15FunctionOptionsEE

r3stl355 avatar Nov 05 '23 18:11 r3stl355

Thanks @r3stl355, I'd assumed that there was some redirect going on but your test returned Healthy. With regard to your error how did you make deltalake library available? I'd used the workspace library management GUI from workspace settings (https://learn.microsoft.com/en-us/fabric/data-science/python-guide/python-library-management), not sure if you can install them at a notebook level; still learning myself!

bcdobbs avatar Nov 05 '23 19:11 bcdobbs

@r3stl355 based on your suggestion I tried running:

import requests
aadToken = PyTridentTokenLibrary.get_access_token("storage")

headersAuth = {
    "Authorization": f"Bearer {aadToken}"
}
output = requests.get("https://onelake.blob.fabric.microsoft.com/xxx/yyy.Lakehouse/Tables", headers=headersAuth)

I get a 200 status code which suggests it's authenticating. (If I remove the auth header it tells me there is an authentication issue.)

bcdobbs avatar Nov 05 '23 20:11 bcdobbs

OK @bcdobbs, ignore everything I wrote before 😁 , this looks like a problem with the writer because Spark writer works and so does the direct API call (i.e. I can create a file under Files with a PUT. I'll carry on digging

r3stl355 avatar Nov 05 '23 20:11 r3stl355

as for the other error I had - I installed deltalake with pip but then figured it doesn't work with the pyarrow in the cluster. Can you please check which version of pyarrow you are running, i.e. pip list

r3stl355 avatar Nov 05 '23 20:11 r3stl355

lastly - this is not just a writer but also a reader problem, I get the same error if I do DeltaTable("abfss://<ws-id>@onelake.dfs.fabric.microsoft.com/<lh-id>/Tables/test", storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"})

r3stl355 avatar Nov 05 '23 20:11 r3stl355

pyarrow is 12.0.0.

bcdobbs avatar Nov 05 '23 20:11 bcdobbs

hmm, strange, I had to lower the pyarrow version to avoid that other error I was getting. Actually, just re-installing v12.0.0 also works - maybe it comes with some incomplete install. Anyways, that folder _delta_log/_last_checkpoint in the error does not actually exist, I wonder if that could be a cause of the problem (resulting in incorrect message perhaps 🤷 )

r3stl355 avatar Nov 05 '23 20:11 r3stl355

there is an issue with pyarrow https://github.com/delta-io/delta-rs/pull/1743

djouallah avatar Nov 05 '23 22:11 djouallah

hmm, ok, tried with deltalake 0.13, and same erros, I think the regression was introduced in Fabric 1.2 runtime, for now better use runtime 1.1 where it works fine.

djouallah avatar Nov 06 '23 00:11 djouallah

Thanks @r3stl355 and @djouallah, really appreciate your time. Indeed reverting the Fabric runtime let's it work fine! Really excited as work for a group of schools so data volumes aren't huge and always looking for ways to keep compute costs low.

Much appreciated

Ben

bcdobbs avatar Nov 06 '23 07:11 bcdobbs

I think I got to the bottom of this. Issue is likely related to the way ADLS access is configured in Azure Fabric - though onelake.blob.fabric.microsoft.com resolves to a public IP in the notebook, there is an entry in /ets/hosts pointing to a loopback IP 127.0.0.2 which uses a self-signed certificate. The same code that fails in Fabrick notebook works in other places (I tried on a local Mac and Azure Web Terminal using the token issued in Fabrick notebook) so this is unlikely a Delta RS problem, more like for Microsoft to solve.

Some extra supporting/interesting data:

  • Azure Web Terminal actually uses the same OS as the Fabric notebook: NAME="Common Base Linux Mariner, VERSION="2.0.20231004"

  • curl in the Fabrick server works but shows that the connection is to a loopback IP 127.0.0.2 which means it may be using a self-signed certificate. (I have a good table there named bad)

> !curl -H "Authorization: Bearer $TOKEN" https://onelake.blob.fabric.microsoft.com/.../.../Tables/bad -verbose

Connected to onelake.blob.fabric.microsoft.com (127.0.0.2) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/pki/tls/certs/ca-bundle.trust.crt
*  CApath: /etc/pki/ca-trust/extracted/openssl
...
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: C=US; ST=Washington; L=Redmond; O=MicrosoftData; OU=SparkDepartment; [email protected]; CN=microsoft.com
*  start date: Nov  6 09:14:13 2023 GMT
*  expire date: Nov  5 09:14:13 2024 GMT
....
*  SSL certificate verify ok.
  • The same command in Web Terminal shows that it's using a public IP this time and different certificate (e.g. different validity dates so)
* Connected to onelake.blob.fabric.microsoft.com (20.50.0.27) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/pki/tls/certs/ca-bundle.trust.crt
*  CApath: /etc/ssl/certs
...
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: C=US; ST=WA; L=Redmond; O=Microsoft Corporation; CN=westeurope.onelake.fabric.microsoft.com
*  start date: Oct  7 14:27:33 2023 GMT
*  expire date: Apr  4 14:27:33 2024 GMT
...
*  SSL certificate verify ok.
  • Open SSL cert verification confirmes the earlier theory about self-signed cert:
  1. In the Fabrick notebook
> !openssl s_client -connect onelake.blob.fabric.microsoft.com:443 -showcerts

CONNECTED(00000003)
depth=0 C = US, ST = Washington, L = Redmond, O = MicrosoftData, OU = SparkDepartment, emailAddress = [email protected], CN = microsoft.com
verify error:num=18:self signed certificate
...
---
SSL handshake has read 1867 bytes and written 407 bytes
Verification error: self signed certificate
---
  1. In Web Terminal
depth=1 C = US, O = Microsoft Corporation, CN = Microsoft Azure TLS Issuing CA 06
verify return:1
depth=0 C = US, ST = WA, L = Redmond, O = Microsoft Corporation, CN = westeurope.onelake.fabric.microsoft.com
verify return:1
...
---
SSL handshake has read 4564 bytes and written 799 bytes
Verification: OK
---
  • Using curl with IPs - works in the Fabrick notebook if I use a loopback IP (e.g. curl https://127.0.0.2/...) but fails with certificate error if I use a public IP returned by nslookup (e.g. curl https://40.82.254.113/...). Using IP in Azure Web Terminal does not work as expected

r3stl355 avatar Nov 06 '23 10:11 r3stl355

A shorter version of the answer - curl in Fabric runtime 1.1 seems to be using a different CA file (/etc/ssl/certs/ca-certificates.crt) than on 1.2 CA file (/etc/pki/tls/certs/ca-bundle.trust.crt), which also has an extra suffix attached to certificate value. /etc/ssl/certs/ca-certificates.crt file is still there on Runtime 1.2 but it does not contain the certificate used by the endpoint.

Maybe openssl is trying to use the /etc/ssl/certs/ca-certificates.crt, or maybe it is unable to properly find the cert in /etc/pki/tls/certs/ca-bundle.trust.crt because of that extra suffix

r3stl355 avatar Nov 06 '23 12:11 r3stl355

and lastly, here is a really ugly solution if you are still keen on trying runtime 1.2.

  1. Run !openssl s_client -connect onelake.blob.fabric.microsoft.com:443 to get the certificate.
  2. Copy the certificate value between -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- and write it out to a local file, e.g.
cert = """-----BEGIN CERTIFICATE-----
MIIFGzCCBAOgAwIBAgIUFO5FzvkmKVoyIlO8gQM8vkcNJ0kwDQYJKoZIhvcNAQEL
BQAwgZ8xCzAJBgNVBAYTAlVTMRMwEQYDVQQIDApXYXNoaW5ndG9uMRAwDgYDVQQH
<rest of the cert value here, shortened for brevity
-----END CERTIFICATE-----
"""
with open("ca.cert", "w") as out:
    out.write(cert)
  1. Export the created file name into ENV var and things should work, e.g.
os.environ["SSL_CERT_FILE"] = "./ca.cert"

workspace_id = <your workspace id here>
lakehouse_id = <your lakehouse id here>
dt = DeltaTable(f"abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/bad", storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"})
print(dt.version())

With this you may actually consider closing this ticket, not the place to be resolved imo

r3stl355 avatar Nov 06 '23 15:11 r3stl355

If you have the option, try to reach out to Microsoft fabric product team directly to flag the regression

ion-elgreco avatar Nov 06 '23 15:11 ion-elgreco

Thanks all, will reach out to Microsoft.

bcdobbs avatar Nov 06 '23 16:11 bcdobbs

please try:

os.environ["SSL_CERT_DIR"] = "/etc/pki/ca-trust/extracted/openssl:/opt/olcclient"

Microsoft fabric onelake team is fixing it.

RobinLin666 avatar Nov 14 '23 05:11 RobinLin666

Maybe it's good to close this, since the issue is caused by Fabric.

ion-elgreco avatar Nov 22 '23 17:11 ion-elgreco

@RobinLin666 Could you give a link to the bug report with the MS Fabric OneLake team that I could follow, regarding the self-signed certificate problem, please? Or is it all just back channels?

re:

os.environ["SSL_CERT_DIR"] = "/etc/pki/ca-trust/extracted/openssl:/opt/olcclient"

This does not seem to work. I'm currently using a variation of the ugly solution suggested by @r3stl355

if not os.path.exists("onelake_cert.crt"):
    os.system("openssl s_client -showcerts -connect onelake.blob.fabric.microsoft.com:443 | awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/' >> onelake_cert.crt")
    os.environ["SSL_CERT_FILE"] = "./onelake_cert.crt"

Hopefully MS will come through with a solution soon. Along with the other delta table write issue, deltalake and polars currently have a severely limited usability in the Fabric environment, which is a pity since I love both.

martroben avatar Feb 14 '24 11:02 martroben