meltano icon indicating copy to clipboard operation
meltano copied to clipboard

feat: add support for `env` key for `jobs`

Open rabidaudio opened this issue 2 years ago • 8 comments

See discussion in slack: https://meltano.slack.com/archives/C01TCRBBJD7/p1657233220239139

Consider example:

jobs:
- name: gitlab
  tasks:
  - dbt-postgres:seed tap-gitlab target-postgres dbt-postgres:snapshot dbt-postgres:run dbt-postgres:test

meltano run gitlab will run all the dbt commands for the whole project rather than allowing me to specify only the specific models that have changed (e.g. DBT_MODELS=+gitlab+).

I realize this can be solved by schedules:

schedules:
- name: gitlab
  interval: '@daily'
  job: gitlab
  env:
    DBT_MODELS: +gitlab+

However it seems to me that this information belongs to the job itself, not the schedule; that is even if I were to manually run meltano run gitlab outside of the @daily schedule, I’d still expect it to apply my configuration.

Proposal: Allow envs to be set on a job basis:

jobs:
- name: gitlab
  tasks:
  - dbt-postgres:seed tap-gitlab target-postgres dbt-postgres:snapshot dbt-postgres:run dbt-postgres:test
  env:
    DBT_MODELS: +gitlab+
    TARGET_POSTGRES_BATCH_SIZE_ROWS: 100 # could also be useful to override plugin configs

Precedence can get tricky here, but I would expect:

meltano environment < job < schedule < .env < system envrionment

rabidaudio avatar Jul 08 '22 15:07 rabidaudio

https://docs.meltano.com/guide/configuration#specifying-environment-variables is the current order

- environment level plugin env # highest
- environment level env
- root level plugin env
- root level env
- .env file
- terminal env # lowest

Assuming both jobs and schedules could specify env I would expect the following:

- environment level plugin env # highest
- environment level env
- schedule level env
- job level env
- root level plugin env
- root level env
- .env file
- terminal env # lowest

my thinking is that schedules could run multiple instances of a job and one would be more inclined to override a setting for a given job on a per schedule basis rather than the other way around. Does that track?

All that said I'm in favor of adding env for both jobs and schedules (I'll open a separate issue for schedules) - the last precedence question will take a bit more debate. cc @aaronsteers

tayloramurphy avatar Jul 08 '22 15:07 tayloramurphy

By a < b I meant b has higher precedence than a but I can see how that was ambiguous. Yes, I agree that schedules should have higher precedence than jobs.

I'm a little surprised that system env vars are overwritten by environments (for example, I might do FOO=override meltano run), but I don't think it's a problem and I'm sure y'all have considered it more than I have, and anyway that's a separate topic

rabidaudio avatar Jul 09 '22 01:07 rabidaudio

I'm a little surprised that system env vars are overwritten by environments (for example, I might do FOO=override meltano run), but I don't think it's a problem and I'm sure y'all have considered it more than I have, and anyway that's a separate topic

@rabidaudio feel free to open a new issue or slack thread if you want to discuss - but what I'll quickly say is that you can still get this behavior I think if you had

environments:
  prod:
    env:
      FOO: $MY_OVERRIDE

and then set MY_OVERRIDE or not in the system env. We did try to put a lot of thought into it and it may be worth adding some of that context in https://docs.meltano.com/guide/configuration#specifying-environment-variables 🤔

tayloramurphy avatar Jul 11 '22 19:07 tayloramurphy

@tayloramurphy @rabidaudio I landed on this while working on https://github.com/meltano/squared/issues/289 where I was trying to get away from exporting env vars before running a command like in this case https://github.com/meltano/squared/blob/64cffc745eba6721fb97490d5aa14f6f7a970afc/data/orchestrate/dag_definition.yml#L146 when using schedules/jobs. I was a able to solve it without schedule/job env vars but was about to request the same feature before I found this issue so I was giving it some thought also:

@rabidaudio I know this is just a solution for this particular problem but what do you think about putting your dbt model selection criteria in a custom command (Squared example)? I've been exploring this pattern so that every set of dbt selection criteria is checked in as code. I know its bit me in the past when selection criteria gets really precise and I either include or exclude something I shouldnt. Did you consider this for youre use case already? Is there any cons that I might be missing other than verboseness?

pnadolny13 avatar Jul 21 '22 20:07 pnadolny13

@cjohnhanson - Assigning to you for scoping/analysis as related to #6387. Please confirm the spec - especially resolution order - before starting dev.

aaronsteers avatar Jul 26 '22 15:07 aaronsteers

@tayloramurphy -- there's work in progress (see #5982) to update the precedence order to be in line with this spec in Miro. New precedence order would be:

- environment level plugin env # highest
- root level plugin env 
- environment level env 
- root level env
- .env file
- terminal env # lowest

Supporting env in schedules and jobs will happen after that work is completed. So in the new precedence order, where would job and schedule envs fit? Would it be: A)

- environment level plugin env # highest
- root level plugin env 
- schedule level env
- job level env
- environment level env # job and schedule take precedence over active environment
- root level env
- .env file
- terminal env # lowest

or B)

- environment level plugin env # highest
- root level plugin env 
- environment level env # active environment takes precedence over job and schedule
- schedule level env
- job level env
- root level env
- .env file
- terminal env # lowest

Or something else?

cjohnhanson avatar Jul 26 '22 18:07 cjohnhanson

@cjohnhanson I think this would be the best order:

- environment level plugin env # highest
- schedule level env
- job level env
- root level plugin env 
- environment level env 
- root level env
- .env file
- terminal env # lowest

My logic here is similar to what I said in https://github.com/meltano/meltano/issues/6386#issuecomment-1179138398 But basically, a job is a collection of plugins run at the same time and users would likely want to override plugin settings at the job level. Similarly, a schedule is a particular instance of a job and there can be multiple schedules per job, so overrides from there make sense above jobs. Then having the environment level plugin env as the highest priority is the last place you could override anything set lower in the hierarchy.

@aaronsteers would you agree on this?

tayloramurphy avatar Jul 26 '22 19:07 tayloramurphy

@aaronsteers @cjohnhanson I'm pulling this out of the next iteration. While I think this would be a useful thing to have, I don't think we need to currently build it ourselves. Thinking through the spec discussion has been useful and clarifying in relation to #6387 but adding support for env at the jobs level isn't critical right now.

I'll mark it as accepting PR's though in case the community is interested in having this.

tayloramurphy avatar Aug 02 '22 15:08 tayloramurphy

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

stale[bot] avatar Apr 26 '23 10:04 stale[bot]

More users interested in this in Slack: https://meltano.slack.com/archives/C01TCRBBJD7/p1690894749896989?thread_ts=1690894749.896989&cid=C01TCRBBJD7

edgarrmondragon avatar Aug 01 '23 14:08 edgarrmondragon