Eliminate internet access requirement for downloading PyPi dependencies when Jobs run
Currently databricks-labs-ucx wheel package with the code that the Jobs run has a dependency on databricks-sdk PyPi package that needs to be downloaded from the internet. Most Enterprise customers have internet access to public PyPi repo disabled, which prevents the Job from running. Instead all dependencies can be downloaded to DBFS during generation and creation of Jobs for the Workspace and sourced from that DBFS location with init script. The following steps can be used for this:
-
Use pip download to download all dependencies of a whl package to storage e.g. pip download
You can specify -d flag with a destination path to download to that location, for example pip download -d /mylibs . If you create requirements.txt file for all your packages (you can do this for all already installed packages by running pip freeze > requirements.txt), then you can just run pip download -r requirement.txt, to download all required packages. -
To then be able to install from that location you need to copy downloaded packages to Workspace DBFS location. So if you copied the packages to DBFS:/mylibs location, to run you can use pip install --no-index -f /dbfs/mylibs
cpmmand. This command will deploy and resolve dependencies from the path you specify with -f flag, --no-index flag tells pip not to look in default PyPi repo. Note, this would need to be done in init script as default Job library dependency configuration does not support specifiying additional pip install flags like -f and --no-index above.
Related to:
- https://github.com/databrickslabs/ucx/issues/1323
What happened is the job loaded the ucx.whl file which then reference the databricks-sdk dependency via pypi.
@Leone67 I think you're writing about another package - databricks-sdk-ucx package doesn't exist. Please clarify
@Leone67 I think you're writing about another package - databricks-sdk-ucx package doesn't exist. Please clarify
Ok, yes it's databricks-labs-ucx. There's only 1 whl package that gets created and added as a library for Jobs, that's the one. It has other dependencies which end up needing to be downloaded from pypi when the Job Tasks start.
My understanding is that there are some dependant packages that are not cross platform so if UCX is being installed from a MacOs or Windows the dependant packages may fail in DB clusters and hence searches for alternatives in Pypi and then fails. In particular, the main package causing this issue is PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.
@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.
Hi @karthikgk25, Some packages are OS and CPU specific, so if you run on your Mac it won't work on Linux. The easiest is to spin up a Databricks cluster, driver only is fine, and download there. It will then download the correct packages to run on the same OS. You can then copy them as required.
@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.
Hi @karthikgk25, Some packages are OS and CPU specific, so if you run on your Mac it won't work on Linux. The easiest is to spin up a Databricks cluster, driver only is fine, and download there. It will then download the correct packages to run on the same OS. You can then copy them as required.
@Leone67 Will give a try and let you know. Thanks!
@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.
Hi @karthikgk25, Some packages are OS and CPU specific, so if you run on your Mac it won't work on Linux. The easiest is to spin up a Databricks cluster, driver only is fine, and download there. It will then download the correct packages to run on the same OS. You can then copy them as required.
@Leone67 Will give a try and let you know. Thanks!
@Leone67 No luck. all the dependent packages are Linux specific. Same issue.
@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.
Hi @karthikgk25, Some packages are OS and CPU specific, so if you run on your Mac it won't work on Linux. The easiest is to spin up a Databricks cluster, driver only is fine, and download there. It will then download the correct packages to run on the same OS. You can then copy them as required.
@Leone67 Will give a try and let you know. Thanks!
@Leone67 No luck. all the dependent packages are Linux specific. Same issue.
Definitely worked for me. Isn't the package name databricks-labs-ucx not databricks_labs_ucx. Do you have underscores in the name?