datahub icon indicating copy to clipboard operation
datahub copied to clipboard

DataHub requires internet access for ingestion to work

Open kha84 opened this issue 7 months ago • 2 comments

Describe the bug

After a successful installation of DataHub to a secured machine without internet access, an ingestion process fails, because it attempts to download packages from https://pypi.python.org/simple/wheel/

To Reproduce Steps to reproduce the behavior:

  1. Install a new instance of DataHub to a machine by following quickstart guide https://docs.datahub.com/docs/quickstart
  2. Turn off the internet access on that machine
  3. Login to DataHub as admin
  4. Go to Ingestion -> Create new source -> select Postgres (my specific example) -> put whatever values as host / port / user / password / database name / datasource name
  5. Click save & run ingestion
  6. See the ingestion process for this new data source has started, then running for some time and and then failed
  7. Click on "Details" and see in the the "Logs" section that it was trying to create python venv and access to pypi.org and then failed:
~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '8bb47f33-cddb-4db7-9369-edb2faddd142',
 'infos': ['2025-05-15 14:15:14.802052 INFO: Starting execution for task with name=RUN_INGEST',
           "2025-05-15 14:17:16.043186 INFO: Failed to execute 'datahub ingest', exit code 2",
           '2025-05-15 14:17:16.043688 INFO: Caught exception EXECUTING task_id=8bb47f33-cddb-4db7-9369-edb2faddd142, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/home/datahub/.venv/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 139, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/home/datahub/.venv/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 402, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv doesn't exist.. minting..
Using CPython 3.10.17 interpreter at: /usr/bin/python
Creating virtual environment at: /tmp/datahub/ingest/venv-postgres-f9103e0adae041e3
Using Python 3.10.17 environment at: /tmp/datahub/ingest/venv-postgres-f9103e0adae041e3
error: Failed to fetch: `https://pypi.python.org/simple/wheel/`
  Caused by: Request failed after 3 retries
  Caused by: error sending request for url (https://pypi.python.org/simple/wheel/)
  Caused by: operation timed out

Expected behavior After installation, DataHub features should work out-of-the-box without the dependency of downloading additional packages from internet.

kha84 avatar May 15 '25 13:05 kha84

For my case, i have to build from local (internet accessible), installing all lib and dependencies. Then pack it as tar and load it on server (block internet).

NoVeTe36 avatar May 16 '25 07:05 NoVeTe36

@NoVeTe36 yeah, something similar is what I did as well - created a new "data source" on a machine with internet and then moved the installation to a secured server. But there's a catch: if you need to add additional "data sources" later on that secured server (even of the same type as you already had), DataHub still attempts to create a brand new python venv and access internet to download packages.

kha84 avatar May 17 '25 09:05 kha84