Implement Task Dependent Runtime Environments for Workers
First check
- [X] I added a descriptive title to this issue.
- [X] I used the GitHub search to find a similar request and didn't find it.
- [X] I searched the Prefect documentation for this feature.
Prefect Version
2.x
Describe the current behavior
Assume your Flows are stored remotely in a git repository. Your Deployments know how to fetch the newest code for a Flow before each and every Flow Run. They’re deployed to Work Pools, and Workers handle the work in their assigned Work Pool.
Flows are developed independently of each other. Each Flow is developed in Python and finish, a requirements.txt file is generated for the project. Every single Flow can be expected to use a unique set of requirements in order to function properly. Maybe:
- Flow 1 Dependancies: (pandas==1.0, requests==2.1, selenium==1.4)
- Flow 2 Dependancies: (numpy==1.2, requests==2.3, selenium==1.4)
This implies that each Flow should have its own runtime environment to satisfy its unique set of dependencies.
However, the Prefect Workpool or Prefect Worker doesn’t seem to provide any manner to adjust the runtime environment based on the task.
The Prefect WORKPOOL does offer configuration options for:
- The Working Directory where workers will begin their work. This is where the
git_clonestep will occur as the Worker fetches the most up-to-date Flow before the Flow Run. - A Command to run in a separate process prior to initiating the Flow Run,
With these configuration options, I see no way to get a Worker to install the correct library Dependancies for the task, or to enter a premade venv (virtual environment).
As it stands, I only see a way to pre-build Workers with foreknowledge of all the Dependancies its assigned Workpool’s Flows will have. The environment hosting the Worker should first install all those dependencies and, only afterward, initiate the Worker process.
However, that solution would imply that all Flows must be deployed to Workpools based on having compatible library dependancies. Alternatively, a user could build a unique Workpool and Worker for every single Task. Either way, I believe this unnecessarily increases the complexity required to use Prefect. It also seems counterintuitive to the idea that a Workpool should contain many Tasks.
Describe the proposed behavior
Workers should have a way to perform Flow specific behavior prior to initiating a Flow Run. Maybe this means allowing the Worker to cd into the Flow’s project directory prior to initiating the Flow Run. Then, the Command configuration could reference a requirements.txt file within the Flow’s project directory. The command could be built to do any number of things such as:
- Create a venv, activate the venv, and install the dependancies (then automatically execute the Flow Run within the venv).
- Alternatively, maybe check
requirements.txtagainst a cache for changes since the last flow run. If no changes, use the last createdvenv. Otherwise, recreate thevenv. In this case, thevenvwould need to be stored outside the project directory but still associated with the Flow somehow. - …
This would put more power in the users hands to architect their use of Prefect how they see fit.
Example Use
If I am recalling correctly, the git_clone step results in a directory matching the schema $WORKDIR/{repo_name}-{branch_name}/your_flow_stuff
If there would be a way for the Work Pool’s Command subprocess to see that {repo_name}-{branch_name} which is holding the current Flow, then maybe the Command could reference that. For example if it were an environment variable:
Command: cd $CURRENT_FLOW_DIR && exec entrypoint.sh
… where entrypoint.sh handles all the dependency management goodies and eventually activates the appropriate venv for the Worker.
“dependency management goodies” could include anything from just recreating the venv on every run, to some kind of cach checking and venv retrieval service.
Additional context
No response
I think this can be accomplished when you're building the deployment -- either by building a custom image that contains the flows' dependencies or by providing extra configuration to install the correct dependencies at runtime. Check out this section in the docs for more info on how to use those features depending on if you're creating the deployment via a yaml file or programmatically.
In addition - @urimandujano your forthcoming work on flow run level infra overrides will provide yet another, more granular, interface to accomplish this. Going to close out