astronomer-cosmos icon indicating copy to clipboard operation
astronomer-cosmos copied to clipboard

Add support for virtual env directory flag

Open LennartKloppenburg opened this issue 1 year ago • 21 comments

Description

Added virtualenv_dir as an option to ExecutionConfig which is then propagated downstream to DbtVirtualenvBaseOperator.

The following now happens:

  • If the flag is set, the operator will attempt to locate the venv's python binary under the provided virtualenv_dir.
    • If so, it will conclude that the venv exists and continues without creating a new one.
    • If not, it will create a new one at virtualenv_dir
  • If the flag is not set, simply continue using the temporary directory solution that was already in place.

Impact

A very basic test using a local docker compose set-up as per the contribution guide and the example_virtualenv DAG saw the DAG's runtime go down from 2m31s to just 32s. I'd this improvement to be even more noticeable with more complex graphs and more python requirements.

Related Issue(s)

Implements #610

Breaking Change?

None, the flag is optional and is ignored (with a warning) when used outside of VirtualEnv execution mode.

Checklist

  • [ ] I have made corresponding changes to the documentation (if required)
  • [ ] I have added tests that prove my fix is effective or that my feature works

LennartKloppenburg avatar Oct 18 '23 13:10 LennartKloppenburg

Deploy Preview for amazing-pothos-a3bca0 processing.

Name Link
Latest commit be0de1a1b6675a05ff008fd09ac05a01c75fcd2d
Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/6582c61c78d17900084dc3d7

netlify[bot] avatar Oct 18 '23 13:10 netlify[bot]

Codecov Report

Attention: Patch coverage is 83.67347% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 93.06%. Comparing base (090116e) to head (87c5da0). Report is 109 commits behind head on main.

:exclamation: Current head 87c5da0 differs from pull request most recent head be0de1a

Please upload reports for the commit be0de1a to get more accurate results.

Files Patch % Lines
cosmos/operators/virtualenv.py 81.39% 8 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #611      +/-   ##
==========================================
- Coverage   93.28%   93.06%   -0.23%     
==========================================
  Files          55       54       -1     
  Lines        2502     2163     -339     
==========================================
- Hits         2334     2013     -321     
+ Misses        168      150      -18     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 18 '23 15:10 codecov[bot]

@tatiana I've updated the PR with some changes you've requested :)

One lingering issue: When there's no virtual env handy and multiple tasks are executed concurrently, they will all attempt the same checks (checking for the dir, for the virtual env, to install stuff etc.) which leads to all kinds of issues because they don't automatically "wait" for the venv to be created by one single operator. This occurs in environments that schedule multiple DBT tasks simultaneously -- which is obviously very common.

The issue can then be resolved by retrying these tasks with some retry_delay, which will achieve the same result as "waiting" for the virtual env to be provisioned would have done. What do you think of this?

LennartKloppenburg avatar Oct 23 '23 13:10 LennartKloppenburg

@tatiana I've updated the PR with some changes you've requested :)

One lingering issue: When there's no virtual env handy and multiple tasks are executed concurrently, they will all attempt the same checks (checking for the dir, for the virtual env, to install stuff etc.) which leads to all kinds of issues because they don't automatically "wait" for the venv to be created by one single operator. This occurs in environments that schedule multiple DBT tasks simultaneously -- which is obviously very common.

The issue can then be resolved by retrying these tasks with some retry_delay, which will achieve the same result as "waiting" for the virtual env to be provisioned would have done. What do you think of this?

I might have an idea...

What if we implement a simple locking mechanism (á la Terraform)? Let's say the first task that is scheduled creates a lock file inside the virtual env directory and releases (removes) it once the virtual env has been set up. Then other tasks first check if this lockfile exists, and "wait" for it to be done. Once it's done, they just return the python binary the way the helper _get_or_create_venv_py_interpreter has done so far. dbt deps will still be installed.

This should be okay in a distributed set-up given that it relies on the local file system.

LennartKloppenburg avatar Oct 23 '23 14:10 LennartKloppenburg

@tatiana I've updated the PR with some changes you've requested :) One lingering issue: When there's no virtual env handy and multiple tasks are executed concurrently, they will all attempt the same checks (checking for the dir, for the virtual env, to install stuff etc.) which leads to all kinds of issues because they don't automatically "wait" for the venv to be created by one single operator. This occurs in environments that schedule multiple DBT tasks simultaneously -- which is obviously very common. The issue can then be resolved by retrying these tasks with some retry_delay, which will achieve the same result as "waiting" for the virtual env to be provisioned would have done. What do you think of this?

I might have an idea...

What if we implement a simple locking mechanism (á la Terraform)? Let's say the first task that is scheduled creates a lock file inside the virtual env directory and releases (removes) it once the virtual env has been set up. Then other tasks first check if this lockfile exists, and "wait" for it to be done. Once it's done, they just return the python binary the way the helper _get_or_create_venv_py_interpreter has done so far. dbt deps will still be installed.

This should be okay in a distributed set-up given that it relies on the local file system.

I like the locking idea! I’ve seen it work in other places well. A couple considerations:

  • what happens if the process that’s placed the lock hangs? Do we need a TTL on it?
  • what happens if there are errors when installing the requirements? Will the lock be released and other tasks try to create the venv again?
  • what happens if the requirements change while the directory/lock is there? This may be more general than just the lock mechanism

jlaneve avatar Oct 23 '23 14:10 jlaneve

Thanks so much for your answer @jlaneve :)

I like the locking idea! I’ve seen it work in other places well. A couple considerations:

  • what happens if the process that’s placed the lock hangs? Do we need a TTL on it?

That's a good question, a TTL would be ideal but I am not sure if this is trivial across different operating systems? Could we get by with temporary files somehow?

  • what happens if there are errors when installing the requirements? Will the lock be released and other tasks try to create the venv again?

That's a good point -- I was thinking about a situation where dependencies are project-wide and there are no groups or individual tasks that have their own dependencies, but that might not be true. In that case, a lock-release should give other tasks the opportunity to acquire the lock and install their own dependencies. There's a big caveat here though, since dbt-project-wide dependencies are still propagated to every individual task, a faulty dependency would very likely result in numerous, repeatedly failing attempts this way. How robust do we want to be here? :) I'd lean towards "if one thing fails, let it fail hard" and give up upon failure because if a project-wide dependency fails, can the DAG even attempt to run dbt models in the first place?

  • what happens if the requirements change while the directory/lock is there? This may be more general than just the lock mechanism

A low-level solution: We could hash/freeze the requirements into the lock file and compare them when checking/acquiring the lock for changes? A more Airflow-focused solution: append/prepend the DAG run's ID (or any combination of identifiers) to the directory so it would create a new virtual env for every new combination. The obvious con being a possibly bloated local file system and the fact that new requirements wouldn't be picked up until a new DAG run.

Just spitballing here, since I'd like to hash these things out here before doing it in code :)

LennartKloppenburg avatar Oct 23 '23 15:10 LennartKloppenburg

Hey @LennartKloppenburg and @jlaneve , I see you advanced in the conversation and possibilities - I really like the way this is heading.

Following the caching suggestion by @LennartKloppenburg , what if we run pip freeze and check if the dependencies/versions given by the user in py_system_site_packages are already installed and only create the virtualenv if not? This may reduce a bit the concurrency issues, although it does not solve the problem if two processes start at the same time.

I also like having a "lock file" with the process's PID running pip install. Other processes could check for this file to see if the process with the given PID is running. If the file is empty or the PID it referred to does not relate to a "running" process, it would be safe for the new process to set its pid and try to run pip install. There will be edge cases, of course, but this could be a first step towards this idea.

@LennartKloppenburg @jlaneve What do you think could be a minimum concurrency solution handling for us to get this PR merged? We can always make follow-up PRs to improve the implementation.

Perhaps we could have one strategy to try to minimize concurrency issues & to document that if multiple worker processes are running in the same node, there may be concurrency issues. And that the current workaround for this is to either not use virtualenv_dir or to run them in separate nodes.

tatiana avatar Oct 24 '23 10:10 tatiana

Hey @LennartKloppenburg and @jlaneve , I see you advanced in the conversation and possibilities - I really like the way this is heading.

Following the caching suggestion by @LennartKloppenburg , what if we run pip freeze and check if the dependencies/versions given by the user in py_system_site_packages are already installed and only create the virtualenv if not? This may reduce a bit the concurrency issues, although it does not solve the problem if two processes start at the same time.

So you're suggesting to run pip freeze when an existing virtualenv is encountered to check if the requirements overlap? That sounds good! To keep things simple we could always just run pip install 'come what may' since pip is usually quite good at determining changes itself (especially with dependencies of dependencies, as pip freeze will yield everything, not just the top-level dependencies in py_system_site_packages) , but there'd be a bit of overhead of course. Perhaps the question is one of speed over completeness here?

I also like having a "lock file" with the process's PID running pip install. Other processes could check for this file to see if the process with the given PID is running. If the file is empty or the PID it referred to does not relate to a "running" process, it would be safe for the new process to set its pid and try to run pip install. There will be edge cases, of course, but this could be a first step towards this idea.

That's great, the PID will help with possibly abandoned processes. Do task executions always get their own PID? And the plan then is to let every task run pip install rather than only do it once? This ties into the previous bit about the pip freeze suggestion! Just confirming :)

@LennartKloppenburg @jlaneve What do you think could be a minimum concurrency solution handling for us to get this PR merged? We can always make follow-up PRs to improve the implementation.

Perhaps we could have one strategy to try to minimize concurrency issues & to document that if multiple worker processes are running in the same node, there may be concurrency issues. And that the current workaround for this is to either not use virtualenv_dir or to run them in separate nodes.

Yep, as a new contributor I'd be happy to defer that kind of decision to you -- not sure how to assess the completeness of this feature and what is acceptable and what isn't :) Checking for the PID in the lockfile should already help and is a small patch on top of what we have now. I already did some work on the lockfile mechanism. If you agree, I can sync that work together with having a PID in there and perhaps the pip requirements?

LennartKloppenburg avatar Oct 26 '23 10:10 LennartKloppenburg

Hi @LennartKloppenburg!

So you're suggesting to run pip freeze when an existing virtualenv is encountered to check if the requirements overlap? That sounds good! To keep things simple, we could always just run pip install 'come what may' since pip is usually quite good at determining changes itself (especially with dependencies of dependencies, as pip freeze will yield everything, not just the top-level dependencies in py_system_site_packages) , but there'd be a bit of overhead of course. Perhaps the question is one of speed over completeness here?

+1 to always run pip install - feels the safest!

Checking for the PID in the lockfile should already help and is a small patch on top of what we have now. I already did some work on the lockfile mechanism. If you agree, I can sync that work together with having a PID in there and perhaps the pip requirements?

Yes, let's do this - it feels like an overall improvement - and we can improve and iterate over time.

tatiana avatar Oct 30 '23 14:10 tatiana

@tatiana I want to test things a bit more thoroughly but I think these changes should involve the gist of what we discussed :)

LennartKloppenburg avatar Nov 06 '23 12:11 LennartKloppenburg

Thanks for addressing all the feedback and working on this thoroughly, @LennartKloppenburg ! Just had a quick look, and we're close to getting this change merged.

Please let us know once you're happy with it. When you have a chance, please also rebase.

tatiana avatar Nov 08 '23 13:11 tatiana

Please let us know once you're happy with it. When you have a chance, please also rebase.

Awesome, will do :) I plan to work on it again later today, have to focus on some other topics first!

LennartKloppenburg avatar Nov 09 '23 11:11 LennartKloppenburg

Thanks for addressing all the feedback and working on this thoroughly, @LennartKloppenburg ! Just had a quick look, and we're close to getting this change merged.

Please let us know once you're happy with it. When you have a chance, please also rebase.

Hi @tatiana You can have another look now :)

LennartKloppenburg avatar Nov 15 '23 13:11 LennartKloppenburg

@tatiana I rebased as well, but find that some tests are now failing. I have no idea why as they seem to be the same as the one in main ?

LennartKloppenburg avatar Nov 16 '23 10:11 LennartKloppenburg

Thanks, @LennartKloppenburg , I'll check once it's rebased and tests are passing!

tatiana avatar Nov 16 '23 15:11 tatiana

Hey @LennartKloppenburg, please, when you have a chance, rebase this PR - would love to see this feature merged sooner than later.

tatiana avatar Dec 13 '23 14:12 tatiana

Hey @LennartKloppenburg, please, when you have a chance, rebase this PR - would love to see this feature merged sooner than later.

Hi @tatiana , so sorry for leaving this! I will get back to it today or tomorrow :) Haven't had the capacity to continue, sorry!

LennartKloppenburg avatar Dec 13 '23 14:12 LennartKloppenburg

@tatiana Just completed the rebase, saw some artifacts that trip up the tests, will look at those tomorrow :) !

LennartKloppenburg avatar Dec 17 '23 18:12 LennartKloppenburg

@tatiana I could use two more eyes on this (from the CI/CD tests):

FAILED tests/dbt/test_graph.py::test_load_via_dbt_ls_project_config_env_vars - cosmos.dbt.graph.CosmosLoadDbtException: Unable to find the dbt executable: dbt
FAILED tests/dbt/test_graph.py::test_load_via_dbt_ls_project_config_dbt_vars - cosmos.dbt.graph.CosmosLoadDbtException: Unable to find the dbt executable: dbt
FAILED tests/dbt/test_graph.py::test_load_via_dbt_ls_render_config_selector_arg_is_used - cosmos.dbt.graph.CosmosLoadDbtException: Unable to find the dbt executable: dbt

When I run the tests locally they pass, maybe I missed something while rebasing? I rebased so much that I no longer know where it was introduced :D Thanks in advance!!

LennartKloppenburg avatar Dec 20 '23 11:12 LennartKloppenburg

Hi @LennartKloppenburg ! I'm sorry for the massive delay, I've been working on other projects and it has been hard to keep up with everything. I'm planning to get back to this PR next week, so we can try to release it as part of Cosmos 1.5

tatiana avatar May 10 '24 15:05 tatiana

Hi @LennartKloppenburg ! I'm very sorry for the very long delay. I solved all the conflicts, and all the tests seem to be passing in #1079 - which is a copy of your PR, with the additional changes:

  1. Rebase
  2. Resolve conflicts
  3. Fix tests

If you are happy with the proposed changes, please feel free to incorporate them into your PR. I'd like us to merge this to have alpha versions and validate this change before the 1.6 release.

tatiana avatar Jul 05 '24 09:07 tatiana

We took this to completion in #1079, giving the credits to @LennartKloppenburg and this original PR

tatiana avatar Aug 16 '24 11:08 tatiana