azure-workload-identity icon indicating copy to clipboard operation
azure-workload-identity copied to clipboard

Sidecar doesn't work properly for Microsoft ODBC Driver for SQL Server

Open thomasfrederikhoeck opened this issue 3 years ago • 15 comments

Describe the bug I have a container that connect to a Azure Database using a User Managed Identity. It previously worked when using AAD Pod identity. I have tried enabling the sidecar (as I think the driver uses ADAL) and I can see in the logs that it make some calls but it somehow doesn't work when it gets back to the container.

Steps To Reproduce

1: Create an appropriate image with ODBC driver and python:

FROM python:3.9.15-buster

RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
&& curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update && \
ACCEPT_EULA=Y apt-get install -y \
        msodbcsql17 \
        unixodbc-dev \
        mssql-tools 


ENV PATH="/opt/mssql-tools/bin:${PATH}"

RUN pip install sqlalchemy pyodbc
COPY script.py /script.py

where the script.py is:

from sqlalchemy import create_engine
import urllib
import os

server= os.getenv("SERVER")
uid = os.getenv("UID")
raw_params = (
"DRIVER={ODBC Driver 17 for SQL Server};"
f"SERVER={server};"
"Encrypt=yes;"
"TrustServerCertificate=no;"
"Connection Timeout=30;"
f"UID={uid};"
"Authentication=ActiveDirectoryMsi"
)
params = urllib.parse.quote_plus(raw_params)
connection_string = f"mssql+pyodbc:///?odbc_connect={params}"
engine = create_engine(connection_string)
engine.execute("select 1")

2: deploy to cluster with following yaml:

apiVersion: v1
kind: Pod
metadata:
  name: awli-test
  annotations:
    azure.workload.identity/inject-proxy-sidecar: "true"
spec:
  serviceAccountName: my-workload-identity
  containers:
  - args:
    - /bin/bash
    - -c
    - --
    - while true; do sleep 30; done;
    env:
      - name: SERVER
        value: URL_FOR_AZURE_SQL_SERVER
      - name: UID
        value: CLIENT_ID_FOR_USER_MANAGED_IDENITITY
    image: test-image:latest
    imagePullPolicy: IfNotPresent
    name: main
    resources:
      limits:
        cpu: "2"
        memory: 10G
      requests:
        cpu: 500m
        memory: 5G

Where URL_FOR_AZURE_SQL_SERVER, CLIENT_ID_FOR_USER_MANAGED_IDENITITY and test-image:latest are changed accordingly.

3: See logs before running script by running

~ kubectl logs  awli-test -c azwi-proxy
I1017 10:02:24.612091       1 proxy.go:96] proxy "msg"="starting the proxy server"  "port"=8000 "userAgent"="azure-workload-identity/proxy/v0.13.0 (linux/amd64) 46c5137/2022-08-31-23:38"
I1017 10:02:24.653617       1 proxy.go:167] proxy "msg"="received readyz request"  "method"="GET" "uri"="/readyz"
  1. Exec into container, run script and wait for failure:
~ kubectl exec --stdin --tty awli-test -- /bin/bash
~ python script.py
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 320, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 884, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 486, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 146, in _do_get
    self._dec_overflow()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 266, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 381, in __init__
    self.__connect()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 678, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 673, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
pyodbc.OperationalError: ('HYT00', '[HYT00] [unixODBC][Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired (0) (SQLDriverConnect)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "//script.py", line 19, in <module>
    engine.execute("select 1")
  File "<string>", line 2, in execute
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/deprecations.py", line 402, in warned
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3256, in execute
    connection = self.connect(close_with_result=True)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3315, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3364, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2198, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 320, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 884, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 486, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 146, in _do_get
    self._dec_overflow()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 266, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 381, in __init__
    self.__connect()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 678, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 673, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.OperationalError: (pyodbc.OperationalError) ('HYT00', '[HYT00] [unixODBC][Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired (0) (SQLDriverConnect)')
  1. Observe that in the logs from the azwi-proxy that calls where made to get a token. (I have changed the CLIENT_ID to something random)
kubectl logs -n prefect awli-test -c azwi-proxy
I1017 10:02:24.612091       1 proxy.go:96] proxy "msg"="starting the proxy server"  "port"=8000 "userAgent"="azure-workload-identity/proxy/v0.13.0 (linux/amd64) 46c5137/2022-08-31-23:38"
I1017 10:02:24.653617       1 proxy.go:167] proxy "msg"="received readyz request"  "method"="GET" "uri"="/readyz"
I1017 10:03:14.931967       1 proxy.go:101] proxy "msg"="received token request"  "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f2"
I1017 10:03:15.432570       1 proxy.go:101] proxy "msg"="received token request"  "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f"
I1017 10:03:16.433120       1 proxy.go:101] proxy "msg"="received token request"  "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f"
I1017 10:03:18.433681       1 proxy.go:101] proxy "msg"="received token request"  "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f"
I1017 10:03:33.455898       1 proxy.go:101] proxy "msg"="received token request"  "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f"
I1017 10:03:33.956334       1 proxy.go:101] proxy "msg"="received token request"  "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f"
I1017 10:03:34.956857       1 proxy.go:101] proxy "msg"="received token request"  "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f"
I1017 10:03:36.957431       1 proxy.go:101] proxy "msg"="received token request"  "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fdatabase.windows.net%2F&object_id=h869042n-6f42-5jg5-5g2f-48956645645f"

Expected behavior That the SQL query would have worked like it did for AAD Pod Identity

Logs See above

Environment

kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.15", GitCommit:"1d79bc3bcccfba7466c44cc2055d6e7442e140ea", GitTreeState:"clean", BuildDate:"2022-09-21T21:45:46Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

Additional context

thomasfrederikhoeck avatar Oct 17 '22 10:10 thomasfrederikhoeck

https://github.com/Azure/azure-workload-identity/pull/594 should be fix for this. I'll test this scenario to confirm.

aramase avatar Oct 17 '22 17:10 aramase

@thomasfrederikhoeck the issue should be fixed with #594. Would you be able to try out aramase/proxy:v0.14.0-alpha.0 for the proxy?

aramase avatar Oct 17 '22 22:10 aramase

Thank you for the quick response and fix @aramase ! The fix still requires sidecar to be used, right? :-) I'm not 100% how to test it as we have deployed the workload identity with az aks update -g <rg_name>-n <cluster_name>--enable-oidc-issuer --enable-workload-identity ? Also, when this is eventually merged will our cluster automatically use the fix because we deployed this way?

thomasfrederikhoeck avatar Oct 18 '22 07:10 thomasfrederikhoeck

@thomasfrederikhoeck The image I shared above is for the sidecar. You can create a yaml similar to this with your application instead of adding the annotation to test if it works for your workload.

Also, when this is eventually merged will our cluster automatically use the fix because we deployed this way?

The fix is merged and will be part of next OSS release. After OSS release, AKS will use the latest release for the add-on and rollout to all clusters.

aramase avatar Oct 18 '22 17:10 aramase

@aramase I didn't fix it for me :-/ I still see the same behavior when using mcr.microsoft.com/oss/azure/workload-identity/proxy-init:v0.13.0 with aramase/proxy:v0.14.0-alpha.0 when I apply:

apiVersion: v1
kind: Pod
metadata:
  name: awli-test
spec:
  serviceAccountName: my-workload-identity
  initContainers:
  - name: init-networking
    image: mcr.microsoft.com/oss/azure/workload-identity/proxy-init:v0.13.0
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
        drop:
        - ALL
      privileged: true
      runAsUser: 0
    env:
    - name: PROXY_PORT
      value: "8000"
  containers:
  - name: proxy
    image: aramase/proxy:v0.14.0-alpha.0
    ports:
    - containerPort: 8000
    resources:
      limits:
        cpu: "2"
        memory: 10G
      requests:
        cpu: 500m
        memory: 2G
  - name: main
    args:
    - /bin/bash
    - -c
    - --
    - while true; do sleep 30; done;
    env:
      - name: SERVER
        value: $SERVER
      - name: UID
        value: $CLIENT_ID
    image: my-image
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: "2"
        memory: 10G
      requests:
        cpu: 500m
        memory: 5G

I know it is not a networking issue as it worked for both AAD Pod identity and for Service Principal. Can you get it to work if you use the image that i provided?

thomasfrederikhoeck avatar Oct 19 '22 08:10 thomasfrederikhoeck

I know it is not a networking issue as it worked for both AAD Pod identity and for Service Principal. Can you get it to work if you use the image that i provided?

@thomasfrederikhoeck for me the sample image doesn't work even without workload/pod identity. I just tried assigning the identity manually to the VMSS and deployed the pod which still fails with the same error.

aramase avatar Oct 19 '22 21:10 aramase

@aramase weird! The above worked for me both in AAD Pod identity and it also currently works fine in Azure Functions with docker image and user-managed identity. Do you have a cluster with AAD pod identity where you can see if it works for you?

thomasfrederikhoeck avatar Oct 20 '22 08:10 thomasfrederikhoeck

@aramase I just SSH'ed into one of our Azure Functions and ran the above python snip-it with success with an appropriate client_id and server address.

The driver version is:

~ odbcinst -q -d -n "ODBC Driver 17 for SQL Server"
[ODBC Driver 17 for SQL Server]
Description=Microsoft ODBC Driver 17 for SQL Server
Driver=/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.10.so.1.1
UsageCount=1

So the above code works with User-Managed Identity in Azure Functions. :-) I it worked with AAD Pod Identity.

thomasfrederikhoeck avatar Oct 20 '22 09:10 thomasfrederikhoeck

@aramase do you have an env where you can replicate? Or still no luck?

thomasfrederikhoeck avatar Oct 25 '22 10:10 thomasfrederikhoeck

@aramase do you have an env where you can replicate? Or still no luck?

Still no luck. I'll try again later this week.

aramase avatar Oct 26 '22 22:10 aramase

Hi @aramase :-) Have you had more time to look at the issue?

thomasfrederikhoeck avatar Nov 15 '22 08:11 thomasfrederikhoeck

Hi @aramase :-) Have you had more time to look at the issue?

@thomasfrederikhoeck unfortunately, not yet! I merged a few more changes for the proxy. Could you try our nightly build ghcr.io/azure/azure-workload-identity/proxy:latest from the main branch?

aramase avatar Nov 15 '22 17:11 aramase

Hi @aramase, just FYI, proxy implementation also break java mssql-jdbc drivers because they expect expire_in field to be string, but it is a number - https://github.com/Azure/azure-workload-identity/blob/d7b71919d06cd963a77b90f0ca4a1ab6d4b83025/pkg/proxy/proxy.go#L59

There is parsing logic based on it, see https://github.com/microsoft/mssql-jdbc/blob/e4e783a0eb13d945b651f400f869cd1edead8f17/src/main/java/com/microsoft/sqlserver/jdbc/SQLServerSecurityUtility.java#L427 This applies to any version before latest, 12 (there was some serious refactoring and they replaced usage of expires_in with expires_at, which is unchanged in proxy)

Could this be fixed? Otherwise it ruins pretty much everything.

vladkun avatar Feb 15 '23 12:02 vladkun

@aramase I'm still seeing this error. I tried deploying a new AKS cluster using the Workload Identity in GA. I tried both newest SQL Server ODBC 17 Driver and SQL ODBC 18 Driver and run the following code, but it still doesn't connect. I have also confimred that it is not a firewall issues since it works with a Service Principal. I have also checked that the Workload Identity works for other Azure resources such as a Key Vault. The application looks like this:

import sqlalchemy
import urllib
import os

import pandas as pd

server = "MY_AZURE_SQL_SERVER"
database = "MY_AZURE_SQL_DATABASE"
driver = "ODBC Driver 18 for SQL Server"
client_id_msi = "MY ID"

connect_params = [
        f"DRIVER={{{driver}}}",
        f"SERVER={{{server}}}",
        f"DATABASE={{{database}}}",
        "Encrypt=yes",
        "TrustServerCertificate=no",
        "Connection Timeout=30",
        f"UID={{{client_id_msi}}}", 
        "Authentication=ActiveDirectoryMsi"
    ]

params = urllib.parse.quote_plus(";".join(connect_params))
connection_string = f"mssql+pyodbc:///?odbc_connect={params}"

engine = sqlalchemy.create_engine(
        connection_string,
        fast_executemany=True,
)

pd.read_sql("select 1", engine)

It looks like there is also people at Microsoft hitting this issue: https://techcommunity.microsoft.com/t5/azure-database-support-blog/lesson-learned-384-odbc-driver-not-supporting-aks-workload/ba-p/3858209

thomasfrederikhoeck avatar Aug 24 '23 18:08 thomasfrederikhoeck

Hi, any updates on this? I'm also having issues with ODBC driver for workload identity. I was able to connect using ADO/jdbc connections strings, but having below error when using odbc. [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Login failed for user 'token-identified principal'.

tianqi-z avatar Oct 24 '23 17:10 tianqi-z