datahub icon indicating copy to clipboard operation
datahub copied to clipboard

MSSQL ingestion fails due to missing pyodbc on Kubernetes with Helm charts

Open mvandeborne61 opened this issue 2 years ago • 6 comments

Describe the bug I run datahub on Kubernetes, deployed with Helm charts. The pod that performs the ingestions is the acryl-datahub-actions that relies on the image acryldata/datahub-actions. The image misses not only the pyodbc package, but also the Microsoft ODBC Driver for SQL Server.

I built a new image on top of the acryldata/datahub-actions, to install the missing dependencies, using this Dockerfile:

FROM acryldata/datahub-actions:head

USER 0
RUN curl https://packages.microsoft.com/keys/microsoft.asc | tee /etc/apt/trusted.gpg.d/microsoft.asc
RUN curl https://packages.microsoft.com/config/debian/11/prod.list | tee /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update && \
    ACCEPT_EULA=Y apt-get install -y msodbcsql17 unixodbc-dev
RUN pip install pyodbc
RUN pip install 'acryl-datahub[mssql]'
USER datahub

but, at runtime of the image, I still get an error "no module name pyodbc", because when the ingestion starts, it creates a python venv at runtime and pip installs its dependencies, except pyodbc, in here: /tmp/datahub/ingest/venv-mssql-94b47a253025c09f/lib/python3.10/site-packages/ . This is an unpredictable path that does not exist when the image is built since the venv isn't yet created at build time.

I have 3 questions:

  1. how to make pyodbc installed in the venv?
  2. I cannot fathom having to rebuild a Docker image is the only way to add plugins or missing dependencies. There needs to exist a way to bring missing dependencies thanks to environments variables (that will pip install whatever is in the list), or a sidecar, or a volume, or init container, or something, right?
  3. I see people tend to use the datahub-ingestion image, whereas in my helm release, it's the datahub-actions pod that runs the ingestions. Am I using the right image? If not, then the Kubernetes deployment doc should be adapted. I see there is a datahub-ingestion-cron sub-chart I could use but it's a cronjob, so it won't run ingestion on demand, I assume.

Expected behavior An mssql source could be ingested with all the necessary dependencies pre-installed.

mvandeborne61 avatar Nov 23 '23 16:11 mvandeborne61

@mvandeborne61, ingestions are executed by /usr/local/bin/ingestion_common.sh in virtual environments, which do not use system packages. To make your custom image work, you need to replace the script and add python3 -m venv --system-site-packages $venv_dir.

vrychkov-repay avatar Nov 27 '23 20:11 vrychkov-repay

@mvandeborne61, if you go system-site-packages way, mind that the official datahub helm chart may reference an outdated actions image, which may contain older/newer datahub packages. This is the case of the latest 0.3.11/0.3.12, which install datahub v0.12.0 across all containers, except actions, which is shipped with datahub v0.10.4.2. It is a non-issue if ingestion is executed in isolated vens. However, with system-site-packages, it makes sense to keep the datahub packages of the custom actions image in sync with deployment-wise datahub version. In this case, the datahub packages will be reused across venvs and won't be installed in each venv, You can find the right image at https://github.com/acryldata/datahub-actions or make your own. For example, 0.3.11/0.3.12 helm charts can be parameterized by the latest actions tag v0.0.14, which has datahub v0.12.0.

vrychkov-repay avatar Nov 28 '23 00:11 vrychkov-repay

@vrychkov-repay , thank you for your response. 2 quick questions:

  1. I couldn't find the script ingestion-common.sh anywhere in github. If I need to modify it, could you tell me where it is?
  2. I find it strange to modify a Docker image and create a custom one to support something as common as SQL Server. wouldn't it make more sense to just add pyodbc and the Microsoft ODBC Driver for SQL Server in the list of dependencies installed in the official images?

mvandeborne61 avatar Nov 28 '23 08:11 mvandeborne61

@mvandeborne61,

  1. As far as I can see, the script is generated at docker build time. You can take the file from the base image of your custom image. Don't forget to sync your copy after any upgrade (from my experience, the file hasn't been change for a while).
  2. I guess, it is due to the choice of free/proprietary drivers.

vrychkov-repay avatar Nov 28 '23 09:11 vrychkov-repay

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Dec 29 '23 01:12 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Feb 02 '24 02:02 github-actions[bot]

You can use the mssql-odbc source type to get it to install the pyodbc package by default. However, you may still need to use a custom image based on datahub-actions to install the underlying odbc driver and accept the EULA.

Additionally, the UI ingestion now supports setting custom dependencies and env variables as needed image

hsheth2 avatar Feb 20 '24 23:02 hsheth2

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Mar 23 '24 01:03 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Apr 22 '24 01:04 github-actions[bot]