ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Eliminate internet access requirement for downloading PyPi dependencies when Jobs run

Open Leone67 opened this issue 2 years ago • 9 comments

Currently databricks-labs-ucx wheel package with the code that the Jobs run has a dependency on databricks-sdk PyPi package that needs to be downloaded from the internet. Most Enterprise customers have internet access to public PyPi repo disabled, which prevents the Job from running. Instead all dependencies can be downloaded to DBFS during generation and creation of Jobs for the Workspace and sourced from that DBFS location with init script. The following steps can be used for this:

  1. Use pip download to download all dependencies of a whl package to storage e.g. pip download You can specify -d flag with a destination path to download to that location, for example pip download -d /mylibs. If you create requirements.txt file for all your packages (you can do this for all already installed packages by running pip freeze > requirements.txt), then you can just run pip download -r requirement.txt, to download all required packages.

  2. To then be able to install from that location you need to copy downloaded packages to Workspace DBFS location. So if you copied the packages to DBFS:/mylibs location, to run you can use pip install --no-index -f /dbfs/mylibs cpmmand. This command will deploy and resolve dependencies from the path you specify with -f flag, --no-index flag tells pip not to look in default PyPi repo. Note, this would need to be done in init script as default Job library dependency configuration does not support specifiying additional pip install flags like -f and --no-index above.

Related to:

  • https://github.com/databrickslabs/ucx/issues/1323

Leone67 avatar Nov 13 '23 13:11 Leone67

What happened is the job loaded the ucx.whl file which then reference the databricks-sdk dependency via pypi.

dmoore247 avatar Nov 13 '23 14:11 dmoore247

@Leone67 I think you're writing about another package - databricks-sdk-ucx package doesn't exist. Please clarify

nfx avatar Nov 14 '23 07:11 nfx

@Leone67 I think you're writing about another package - databricks-sdk-ucx package doesn't exist. Please clarify

Ok, yes it's databricks-labs-ucx. There's only 1 whl package that gets created and added as a library for Jobs, that's the one. It has other dependencies which end up needing to be downloaded from pypi when the Job Tasks start.

Leone67 avatar Nov 14 '23 09:11 Leone67

My understanding is that there are some dependant packages that are not cross platform so if UCX is being installed from a MacOs or Windows the dependant packages may fail in DB clusters and hence searches for alternatives in Pypi and then fails. In particular, the main package causing this issue is PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

aminmovahed-db avatar Nov 16 '23 02:11 aminmovahed-db

@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.

karthikgk25 avatar Nov 20 '23 18:11 karthikgk25

@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.

Hi @karthikgk25, Some packages are OS and CPU specific, so if you run on your Mac it won't work on Linux. The easiest is to spin up a Databricks cluster, driver only is fine, and download there. It will then download the correct packages to run on the same OS. You can then copy them as required.

Leone67 avatar Nov 20 '23 19:11 Leone67

@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.

Hi @karthikgk25, Some packages are OS and CPU specific, so if you run on your Mac it won't work on Linux. The easiest is to spin up a Databricks cluster, driver only is fine, and download there. It will then download the correct packages to run on the same OS. You can then copy them as required.

@Leone67 Will give a try and let you know. Thanks!

karthikgk25 avatar Nov 20 '23 20:11 karthikgk25

@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.

Hi @karthikgk25, Some packages are OS and CPU specific, so if you run on your Mac it won't work on Linux. The easiest is to spin up a Databricks cluster, driver only is fine, and download there. It will then download the correct packages to run on the same OS. You can then copy them as required.

@Leone67 Will give a try and let you know. Thanks!

@Leone67 No luck. all the dependent packages are Linux specific. Same issue.

karthikgk25 avatar Nov 27 '23 13:11 karthikgk25

@Leone67 I was in the same situation where my workspace has no access to pypi. I have tried to download the packages and added them to init scripts and updated the jobs to use the init scripts at the start of job clusters. I still end up with failed Jobs and was not able to resolve. The initial failures on the jobs were reference to "DRIVER_LIBRARY_INSTALLATION_FAILURE" which is due to blocked pypi. After going through the dependencies and have them within init scripts, the job started failing with "Python wheel with name databricks_labs_ucx could not be found" though it is added to dependent library by ucx repo. Let me know if you are able to get through it without pypi access to workspace.

Hi @karthikgk25, Some packages are OS and CPU specific, so if you run on your Mac it won't work on Linux. The easiest is to spin up a Databricks cluster, driver only is fine, and download there. It will then download the correct packages to run on the same OS. You can then copy them as required.

@Leone67 Will give a try and let you know. Thanks!

@Leone67 No luck. all the dependent packages are Linux specific. Same issue.

Definitely worked for me. Isn't the package name databricks-labs-ucx not databricks_labs_ucx. Do you have underscores in the name?

Leone67 avatar Nov 27 '23 13:11 Leone67