prefect Implement Task Dependent Runtime Environments for Workers

First check

[X] I added a descriptive title to this issue.
[X] I used the GitHub search to find a similar request and didn't find it.
[X] I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the current behavior

Assume your Flows are stored remotely in a git repository. Your Deployments know how to fetch the newest code for a Flow before each and every Flow Run. They’re deployed to Work Pools, and Workers handle the work in their assigned Work Pool.

Flows are developed independently of each other. Each Flow is developed in Python and finish, a requirements.txt file is generated for the project. Every single Flow can be expected to use a unique set of requirements in order to function properly. Maybe:

Flow 1 Dependancies: (pandas==1.0, requests==2.1, selenium==1.4)
Flow 2 Dependancies: (numpy==1.2, requests==2.3, selenium==1.4)

This implies that each Flow should have its own runtime environment to satisfy its unique set of dependencies.

However, the Prefect Workpool or Prefect Worker doesn’t seem to provide any manner to adjust the runtime environment based on the task.

The Prefect WORKPOOL does offer configuration options for:

The Working Directory where workers will begin their work. This is where the git_clone step will occur as the Worker fetches the most up-to-date Flow before the Flow Run.
A Command to run in a separate process prior to initiating the Flow Run,

With these configuration options, I see no way to get a Worker to install the correct library Dependancies for the task, or to enter a premade venv (virtual environment).

As it stands, I only see a way to pre-build Workers with foreknowledge of all the Dependancies its assigned Workpool’s Flows will have. The environment hosting the Worker should first install all those dependencies and, only afterward, initiate the Worker process.

However, that solution would imply that all Flows must be deployed to Workpools based on having compatible library dependancies. Alternatively, a user could build a unique Workpool and Worker for every single Task. Either way, I believe this unnecessarily increases the complexity required to use Prefect. It also seems counterintuitive to the idea that a Workpool should contain many Tasks.

Describe the proposed behavior

Workers should have a way to perform Flow specific behavior prior to initiating a Flow Run. Maybe this means allowing the Worker to cd into the Flow’s project directory prior to initiating the Flow Run. Then, the Command configuration could reference a requirements.txt file within the Flow’s project directory. The command could be built to do any number of things such as:

Create a venv, activate the venv, and install the dependancies (then automatically execute the Flow Run within the venv).
Alternatively, maybe check requirements.txt against a cache for changes since the last flow run. If no changes, use the last created venv. Otherwise, recreate the venv. In this case, the venv would need to be stored outside the project directory but still associated with the Flow somehow.
…

This would put more power in the users hands to architect their use of Prefect how they see fit.

Example Use

If I am recalling correctly, the git_clone step results in a directory matching the schema $WORKDIR/{repo_name}-{branch_name}/your_flow_stuff

If there would be a way for the Work Pool’s Command subprocess to see that {repo_name}-{branch_name} which is holding the current Flow, then maybe the Command could reference that. For example if it were an environment variable:

Command: cd $CURRENT_FLOW_DIR && exec entrypoint.sh
… where entrypoint.sh handles all the dependency management goodies and eventually activates the appropriate venv for the Worker.

“dependency management goodies” could include anything from just recreating the venv on every run, to some kind of cach checking and venv retrieval service.

Additional context

No response

Jan 18 '24 19:01 MrChadMWood

I think this can be accomplished when you're building the deployment -- either by building a custom image that contains the flows' dependencies or by providing extra configuration to install the correct dependencies at runtime. Check out this section in the docs for more info on how to use those features depending on if you're creating the deployment via a yaml file or programmatically.

Jan 18 '24 20:01 urimandujano

In addition - @urimandujano your forthcoming work on flow run level infra overrides will provide yet another, more granular, interface to accomplish this. Going to close out

Feb 28 '24 19:02 WillRaphaelson