[FEATURE]: Support for HTTP Proxy Configuration in Air-gapped Databricks Workspaces
Is there an existing issue for this?
- [X] I have searched the existing issues
Problem statement
After UCX installation, runing the assessment workflow in an air-gapped Databricks workspace is not possible.
Some tasks in the assessment Workflow require downloading libraries from the internet, example "setup_tacl" Task. Our workspace cannot access the internet without going through the HTTP proxy, so these tasks are failing.
Proposed Solution
-
There needs to be a way to pass the proxy configuration as a parameter to the tasks in the workflow that require installing libraries from the internet.
-
On possible solution, is to use init scripts. Example of the init script, applied to main and tacl clusters:
export HTTPS_PROXY=http://myproxy:8080
export HTTP_PROXY=http://myproxy:8080
export https_proxy=http://myproxy:8080
export http_proxy=http://myproxy:8080
pip install databricks-labs-blueprint
pip install databricks-sdk==0.24.0
pip install pyyaml
pip install databricks-labs-lsql==0.3.0
Could you please propose a new version that supports execution of workflows in an air-gapped Workspaces?
Additional Context
Related to:
- https://github.com/databrickslabs/ucx/issues/573
@kfarhane28 Does that proxy env variable exist on the machine which runs the install command? Is the proxy same in both cases?
@nfx In fact the http proxy used on the cloud is different from the proxy used on the machine running the install. The install is done from my windows laptop on-premise.
I researched this issue, herewith the summary:
Where is the HTTP proxy relevant?
| Situation | Comment |
|---|---|
| When installing ucx dependencies at Databricks runtime, for example at the start of the assessment workflow. | #573 resolves this by uploading the dependencies (as wheels) to Databricks when installing ucx |
When installing ucx using databricks labs install ucx |
The proxy environment variables can be set in the shell running the installation command, similar to the mentioned init script in the top comment. T.b.d. if the proxy environment variables are sufficient or if support for passing the proxy settings to the installation is required. |
When installing a non-ucx dependency at Databricks runtime, for example with a %pip install ... in a notebook |
Relevant when resolving dependencies for linting, though, not sure if it is relevant for Air-gapped Databricks workspaces as that pip command will not work thus unlikely to be defined or if it works with optional pip install flags then those flags are passed to the pip install called by the linter. |
@kfarhane28 : Could you verify if you agree with the above table? Specifically, did #573 resolve the issue of installing ucx's dependencies at Databricks runtime and could you install ucx from your machine using the proxy environment variables?
Wheel house
A wheel house is the collection of wheels to install ucx and its dependencies. In it's simplest form, these wheels are kept inside the ucx Github repository: wheelhouse/....whl.
Approaches for creating a wheel house:
-
Wheelhouse
- Approach : Keep ucx's dependencies as (binary) wheels in the github repository
- Pros :
- Ucx's dependencies are kept inside its Github repository, thus available at installation
- Considerations :
- The wheel house tool is archived
- Its owners removed the dependency on wheelhouse in their example project
- The package shows little users (in the form of stars)
-
pip-tools
- Approach : Use
pip-toolsto pin ucx dependencies. Install and store those in the ucx repository, like our own "wheel house". - Pros:
- Wheelhouse owners mentioned they switched to pip-tools
-
pip-toolsis mentioned on the official python packaging guide to create lock files - Lots of usage (stars)
- Fairly recent activity: last release is in March
- Considerations:
- We need to make sure cross-environment installations work. The least, the installation of wheels created to-be-commited to the ucx "wheel house" inside this repository should work on the Databricks runtime used by ucx.
- Use the
pip-synccommand to install the libraries in the ucx "wheel house" inside this repository. In theory, should be possible by using--pip-argsto pass--target ./wheelhouseto the underlyingpip installcalls.
- Approach : Use
-
hatch
- Approach : Install ucx and its dependencies using hatch, then
pip freezeto lock the dependencies. Use thepip freezeto install locked dependencies into a "wheel house" inside the ucx repo - Pros :
- Already used by ucx
- Lots of usage (stars)
- Recent activity : latest release in end of may
- Considerations:
-
"there is no support for re-creating an environment given a set of dependencies in a reproducible manner"; does not support lock files. Could be circumvented by installing the dependencies using hatch and then
pip freeze, but that has less guarantee to be reproducible. - No functionality for upgrading (specific) dependencies, can regenerate the
pip freezeand use git to track the diffs - We need to make sure cross-environment installations work. The least, the installation of wheels created to-be-commited to the ucx "wheel house" inside this repository should work on the Databricks runtime used by ucx.
-
"there is no support for re-creating an environment given a set of dependencies in a reproducible manner"; does not support lock files. Could be circumvented by installing the dependencies using hatch and then
- Approach : Install ucx and its dependencies using hatch, then
Independent of the approach for creating the wheel house, it implies keeping binaries in ucx's Github repository
Suggestions
- Single way of installing ucx's dependencies at Databricks runtime:
a. Always upload the wheels when installing ucx to Databricks
b. Always install the dependencies at Databricks runtime by referencing a wheel. To be sure, we add the
--no-indexflag to thepip installcommand so that PyPi is not used. - Create a "wheel house" inside the ucx repository using
pip-toolsand update the ucx install script to use these wheels for step 1. - Alternative to 2, support passing pip install flags during ucx installation, similar to pip-tools'
--pip-args, to allow users to pass the--proxyflag to thepip install.