ucx icon indicating copy to clipboard operation
ucx copied to clipboard

[FEATURE]: Support for HTTP Proxy Configuration in Air-gapped Databricks Workspaces

Open kfarhane28 opened this issue 2 years ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Problem statement

After UCX installation, runing the assessment workflow in an air-gapped Databricks workspace is not possible.

Some tasks in the assessment Workflow require downloading libraries from the internet, example "setup_tacl" Task. Our workspace cannot access the internet without going through the HTTP proxy, so these tasks are failing.

Proposed Solution

  • There needs to be a way to pass the proxy configuration as a parameter to the tasks in the workflow that require installing libraries from the internet.

  • On possible solution, is to use init scripts. Example of the init script, applied to main and tacl clusters:

export HTTPS_PROXY=http://myproxy:8080
export HTTP_PROXY=http://myproxy:8080
export https_proxy=http://myproxy:8080
export http_proxy=http://myproxy:8080

pip install databricks-labs-blueprint
pip install databricks-sdk==0.24.0
pip install pyyaml
pip install databricks-labs-lsql==0.3.0

Could you please propose a new version that supports execution of workflows in an air-gapped Workspaces?

Additional Context

Related to:

  • https://github.com/databrickslabs/ucx/issues/573

kfarhane28 avatar Apr 08 '24 00:04 kfarhane28

@kfarhane28 Does that proxy env variable exist on the machine which runs the install command? Is the proxy same in both cases?

nfx avatar Apr 09 '24 07:04 nfx

@nfx In fact the http proxy used on the cloud is different from the proxy used on the machine running the install. The install is done from my windows laptop on-premise.

kfarhane28 avatar Apr 09 '24 08:04 kfarhane28

I researched this issue, herewith the summary:

Where is the HTTP proxy relevant?

Situation Comment
When installing ucx dependencies at Databricks runtime, for example at the start of the assessment workflow. #573 resolves this by uploading the dependencies (as wheels) to Databricks when installing ucx
When installing ucx using databricks labs install ucx The proxy environment variables can be set in the shell running the installation command, similar to the mentioned init script in the top comment. T.b.d. if the proxy environment variables are sufficient or if support for passing the proxy settings to the installation is required.
When installing a non-ucx dependency at Databricks runtime, for example with a %pip install ... in a notebook Relevant when resolving dependencies for linting, though, not sure if it is relevant for Air-gapped Databricks workspaces as that pip command will not work thus unlikely to be defined or if it works with optional pip install flags then those flags are passed to the pip install called by the linter.

@kfarhane28 : Could you verify if you agree with the above table? Specifically, did #573 resolve the issue of installing ucx's dependencies at Databricks runtime and could you install ucx from your machine using the proxy environment variables?

Wheel house

A wheel house is the collection of wheels to install ucx and its dependencies. In it's simplest form, these wheels are kept inside the ucx Github repository: wheelhouse/....whl.

Approaches for creating a wheel house:

  • Wheelhouse
    • Approach : Keep ucx's dependencies as (binary) wheels in the github repository
    • Pros :
      • Ucx's dependencies are kept inside its Github repository, thus available at installation
    • Considerations :
  • pip-tools
    • Approach : Use pip-tools to pin ucx dependencies. Install and store those in the ucx repository, like our own "wheel house".
    • Pros:
    • Considerations:
      • We need to make sure cross-environment installations work. The least, the installation of wheels created to-be-commited to the ucx "wheel house" inside this repository should work on the Databricks runtime used by ucx.
      • Use the pip-sync command to install the libraries in the ucx "wheel house" inside this repository. In theory, should be possible by using --pip-args to pass --target ./wheelhouse to the underlying pip install calls.
  • hatch
    • Approach : Install ucx and its dependencies using hatch, then pip freeze to lock the dependencies. Use the pip freeze to install locked dependencies into a "wheel house" inside the ucx repo
    • Pros :
    • Considerations:
      • "there is no support for re-creating an environment given a set of dependencies in a reproducible manner"; does not support lock files. Could be circumvented by installing the dependencies using hatch and then pip freeze, but that has less guarantee to be reproducible.
      • No functionality for upgrading (specific) dependencies, can regenerate the pip freeze and use git to track the diffs
      • We need to make sure cross-environment installations work. The least, the installation of wheels created to-be-commited to the ucx "wheel house" inside this repository should work on the Databricks runtime used by ucx.

Independent of the approach for creating the wheel house, it implies keeping binaries in ucx's Github repository

Suggestions

  1. Single way of installing ucx's dependencies at Databricks runtime: a. Always upload the wheels when installing ucx to Databricks b. Always install the dependencies at Databricks runtime by referencing a wheel. To be sure, we add the --no-index flag to the pip install command so that PyPi is not used.
  2. Create a "wheel house" inside the ucx repository using pip-tools and update the ucx install script to use these wheels for step 1.
  3. Alternative to 2, support passing pip install flags during ucx installation, similar to pip-tools' --pip-args, to allow users to pass the --proxy flag to the pip install.

JCZuurmond avatar Jul 17 '24 09:07 JCZuurmond