dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[Feature] CLI Parameter for `packages-install-path`

Open stevenayers opened this issue 1 year ago • 2 comments

Is this your first time submitting a feature request?

  • [X] I have read the expectations for open source contributors
  • [X] I have searched the existing issues, and I could not find an existing issue for this feature
  • [X] I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Add a CLI parameter for the packages-install-path, similar to how target-path has one.

In the docs, under target-path, it says:

Just like other global configs, it is possible to override these values for your environment or invocation by using the CLI option (--target-path) or environment variables (DBT_TARGET_PATH).

Describe alternatives you've considered

Using the env var DBT_PACKAGES_INSTALL_PATH.

The issue here is that some orchestration tools, such as Databricks DBT Workflows make setting environment variables very difficult. By adding this cli parameter, we maintain consistency across global configs.

Who will this benefit?

People using orchestration tools with awkward limitations.

Are you interested in contributing this feature?

Yes, the PR is https://github.com/dbt-labs/dbt-core/pull/9933

stevenayers avatar Apr 13 '24 08:04 stevenayers

Thanks for opening this @stevenayers !

Can you share more about the specific use cases where combining a CLI flag with an environment variable is necessary or beneficial versus just merely including the packages-install-path configuration in dbt_project.yml?

dbeatty10 avatar Apr 16 '24 00:04 dbeatty10

Hi @dbeatty10, sure no problem! Let me break this down a bit.

Hardcoding packages-install-path

1. In scenarios when docker containers are being used this can raise difficulties. I won't go into too much detail because it's been documented quite well in this issue https://github.com/dbt-labs/dbt-core/issues/1710.

2. When you are dealing with a lot of orchestration/workflow systems you will often find that the working directory of each step does not share the same working directory as the previous, and they can often be dynamic. Take this pipeline as an example:

  graph LR;
      A[dbt debug]-->B[dbt run];
      B-->C[dbt test];
      C-->D[dbt docs generate];

Each working directory could look something like /tmp/job-id/step-id

  • dbt debug: /tmp/1ad0ceb/ee74a60082b34c3a3d0df8a0d5d5cbfd7ec5ed6a
  • dbt run: /tmp/1ad0ceb/607646b627e80fe5e45545589fc8c09482010978
  • dbt run: /tmp/1ad0ceb/7e164e3ab723c357cb638ad6c1e1beef19a7fec6
  • dbt test: /tmp/1ad0ceb/cb56f4fdc16d5a79953af3003645a1af5a000926

With this, you don't want to be re-installing your deps at every stage, and likely want to reuse them. This is where, like in issue #1710, you will want to use an environment variable like:

config-version: 2
packages-install-path: "{{ env_var('DBT_PACKAGES_INSTALL_PATH', 'dbt_packages') }}"

You could set packages-install-path: "../dbt_packages", but that's making assumptions when you sometimes need to use shell script logic to figure out what that directory path needs to be.

3. Say you have set packages-install-path to /tmp/my_custom_packages_path so it can be shared between steps. What if you're also running your CI/CD test pipeline in that environment?

Your packages.yml is changed in your feature branch, which updates the package contents in /tmp/my_custom_packages_path. Your live data pipeline is in the middle of running, and when it goes to run, it fails because your feature branch has removed packages your live data pipeline was using when it was running.

This is where you'll want to do something like:

config-version: 2
packages-install-path: "{{ env_var('DBT_PACKAGES_INSTALL_PATH', 'dbt_packages') }}"

and in your pipeline you'll want to set DBT_PACKAGES_INSTALL_PATH to something like /tmp/${ENVIRONMENT}/dbt_packages.

Flag vs env var for packages-install-path

As I mentioned in the original issue, sometimes setting an environment variable can be a pain in some workflow systems. This also isn't very consistent or clean: DBT_PACKAGES_INSTALL_PATH=/tmp/${ENVIRONMENT}/dbt_packages dbt run --target-path /tmp/${ENVIRONMENT}/target

You're setting config paths via two different methods.

stevenayers avatar Apr 16 '24 06:04 stevenayers

Yesterday @jtcohen6 and myself had a chance to discuss the proposed new CLI flag + environment variable.

Summary

The general case

We've approached where flags can be set differently depending on use-case:

  • configuration settings in dbt_project.yml file are reserved for things that don't change (very often) and are shared across users and invocations, whereas
  • CLI flags are used for things that may change very often (i.e. per invocation)

So generally, we don't let these be set in both places, and it would take a really compelling case for us to do so.

This specific case

In this case, it sounds like the main barrier is that setting environment variables is difficult within Databricks DBT Workflows. If this is the primary barrier, then we'd prefer not to add a new feature to dbt in order to work around it.

So we're closing this and the associated PR in https://github.com/dbt-labs/dbt-core/pull/9933 as not planned.

But if anyone can provide additional examples why should consider supporting a new --packages-install-path CLI flag (and associated DBT_PACKAGES_INSTALL_PATH environment variable) outside of Databricks DBT Workflows, we'd be willing to take another look.

dbeatty10 avatar Jul 10 '24 17:07 dbeatty10