prefect Backlog: add ModuleStorage to support packaging flow code along with flow code dependencies

Description

Request from Club42 member:

@Malthe_Karbo: Hi, is there any plans to bring back 'module' storage as in prefect 1.0?

We currently pack all our flows into a single docker image - from what I read in the documentation, we have to use remote storage (s3 for us) when using k8s flow runner, even though both our agent and our flow runners have access to the flow.

When we were on 1.0, we just used module path imports and it worked really well and minimized points of failures (e.g., bad IAM setup, network issues)

@Anna_Geller: I'm not sure about Module Storage but we plan to add some way to support Git based storage such as e.g. GitHub - would it be sufficient for your use case? Assuming that most users version-control their flows and deployments, I wonder whether GitHub storage is even more convenient than Module storage since it allows you to separate flow code from other module dependencies

@Malthe_Karbo: S3 is also sufficient - however both git and s3 relies on external storage, which is what I don't understand the restriction to - I think it is a limitation to not allow for running container jobs without access to some external storage for flow code

Our jobs always have the correct version of the flow installed in the job python runtime (as well as our other libraries), which is built upon merge to central branch in our CI/CD pipeline, pushed to our container registry and then the prefect cloud deployments are updated to point to the new image.

In 1.0, we did the same, however there we could use module storage to say it should just import the flow from e.g., "pipelines.flows.flow_1" instead of going to github.com/myorg/mymonorepo/pipelines/flows/flow_1.py or s3://myflowstoragebucket/ - if that makes sense? :slightly_smiling_face:

We always execute jobs from k8s, if that adds any context

@Anna_Geller: Prefect 2.0 no longer requires DAG preregistration. You can point to your flow file in your deployment spec, and Prefect will download it at runtime rather than using flow code baked (almost "hardcoded"?) into the image. This allows you for much greater flexibility: • your flow may change without having to recreate a deployment - e.g. you commit changes to flow code to your Git repository or put the file in S3, and you don't need to perform any costly operations of building all code dependencies along with your flow • based on our experience with Prefect users, the code dependencies usually change much less frequently than the flow code - separating concerns here allows you to have your Docker image or Kubernetes job with a fairly static definition while flow code remains fully dynamic without the need for costly redeployments • it significantly reduces latency in deploying new flow versions • it often reduces user's costs due to fewer image versions needing to be stored (images with all dependencies can range from 100MB up to several GB, while flow code is a lightweight single python file)

I can open an issue for ModuleStorage block to the backlog, but I hope the above explanation clarifies our product thinking and why we think that separating concerns here is beneficial

Jun 11 '22 23:06 anna-geller

Absolutely agree with the comment from Anna! A couple of remarks from a release management perspective though: storing flow code in S3 or e.g. GitHub instead of a container image automatically creates different release cadences for flow code and dependencies. Even though that could useful in cases where you only use dependencies that have little to do with Prefect. And even if you have some “utility” tasks that are perhaps re-used among different flows, you can also add those to the container for convenience (I would consider it good practice). However, if you have a tasks module (containing too many tasks to put in one flow file) that is tightly coupled to the flow structure (i.e. where a change to a task definition implies a change to the flow and vice versa), it is not always useful to store your flow code somewhere else. In fact, having one artifact (in this case a container) that contains all components you need (in some cases, I even found it useful to add a trained model artifacts to the container) makes it easier to reason about when it comes down to figuring out what ran where at which moment in time. Does that make sense?

Jun 12 '22 18:06 MatthiasRoels

Alternatively, it could also be useful to be able to do something similar as https://github.com/anna-geller/packaging-prefect-flows/blob/master/flows_no_build/docker_script_kubernetes_run_custom_ecr_image.py in Orion…

Jun 12 '22 19:06 MatthiasRoels

I was also trying to find if Orion would bring back module storage and stumbled on this thread.

It makes sense to avoid module storage when you're using containerization and want to minimize redeployment time, but I was wondering if someone could give advice on my use-case. For my setup...

My workflow library is dependent on the package they are stored in
I want to access orchestration features on an entirely local setup (local server, agent, and cluster)
I want to avoid users setting up their own external storage

For point 1, I'm currently maintaining a package that subclasses prefect's Flow class (here) and contains a library of dynamically-built workflows. Subclassing let's me add higher-level features such as the run_cloud method:

from simmate.workflows import example_flow

state = example_flow.run(...)  # use run method because of prefect v1
state = example_flow.run_cloud(...)  # registers to prefect cloud and (optionally) waits for result

There is other functionality like automated flow registration (or automated 'deployment' in v2), but that's elsewhere in the repo. So my workflow library is actually dependent on the repo itself.

For point 2, I frequently carry out orchestration entirely on my local computer (a local prefect agent + dask cluster + a running prefect server) with the same conda env. An example of this is running 5k+ subworkflows that are part of a larger analysis, and determining when to submit new flow runs is reliant on prefect's orchestration API (e.g. one analysis makes sure there are always 10 flow runs submitted/active of a given flow type). There are times I'd like to run this analysis without internet access too.

And for point 3, when others install my package, there's really no need to set up an S3 bucket -- they have everything in their conda environment already. There are even cases where it isn't possible on academic/national HPC clusters to make connections to external storage (outgoing connections are not allowed or alternatively need to go through a vetting process). Requiring users to set up an S3 bucket or use github urls is just extra headache, and it also increases the barrier to entry. The increased barrier-to-entry is also true for any orion user -- in fact, I couldn't get through the tutorials because storage setup isn't possible with the service I have experience in.

Is there a way to meet these needs without module storage? I thought about hosting a bucket that all users can access -- or alternatively have deployment specs always point to my parent github repo. But I'd like to avoid these solution because they create new issues, add more failure points, and are still dependent on internet connections.

Jul 03 '22 23:07 jacksund

Thanks for sharing your use case. I totally understand what you mean, external storage is not always desirable. But it looks like you could add your custom modules to your PYTHONPATH if everything runs locally on the same machine. The flow storage is a bit different than code dependencies though so for this you may leverage Local storage blocks if everything is on a single VM. External storage such as S3 is only required for Docker and Kubernetes flow runners because containers by default cannot access your flow code living outside of the containers/pods

We're gonna release improvements to storage and deployments next week - I'd encourage you to check that out and see if the new release better serves your use case - the documentation will be updated too.

Jul 04 '22 00:07 anna-geller

Thanks for the quick reply!

you could add your custom modules to your PYTHONPATH

Yep! If I'm understanding you correctly, I do this already. I pip install my package so all custom modules and workflows are accessible in the python path and can be imported.

External storage such as S3 is only required for Docker and Kubernetes flow runners

Is there a way to make this tutorial so that you don't need S3 configuration? The tutorial uses SubprocessFlowRunner, but I got an error when I tried with local storage instead of S3.

I'd encourage you to check that out and see if the new release better serves your use case

Absolutely. Because of how reliant my package is on prefect storage, I'll definitely check the new release out once it's available. Is there a PR/issue I can follow along with so I know when it's released?

Jul 04 '22 00:07 jacksund

SubprocessFlowRunner should work fine with Local storage - if not, please open a separate issue or open a thread in Community Slack https://prefect.io/slack

To get notified about any new release, you can subscribe to this Discourse tag: https://discourse.prefect.io/tag/release-notes

Jul 04 '22 00:07 anna-geller

Awesome, thank you! And I'll test functionality again tomorrow and open an issue if it's needed -- maybe I made an error.

Jul 04 '22 01:07 jacksund

I'm not up to date on all of this thread, but the next release will support only storing a reference to the flow's import path in the Prefect database which should address the cases you've described Jack. It'll be accessible as an ImportSerializer mode for the deployment.

Jul 04 '22 01:07 zanieb

My use-case is similar to @MatthiasRoels.

My flow is part of a larger code base build as a private Python package, with various (sub)modules in multiple files, which it needs access to during execution as a kubernetes job. The prefect server is installed on a (single) on-prem k3s-node - no cloud involved (which is a rather smooth setup by the way; although it might seem like an overkill).

The flow definition itself is thin, calling out to other modules and hardly changes at all, whereas the python package and its modules changes frequently.

Hence, for every release a container image is built containing both dependencies and the package including the flow itself. This is the only way I’ve found so far in 1.x to have dependencies available at all and retain consistency between the flow and its dependencies.

Additionally, I cannot use any remote cloud storage. Hence, I’d welcome a way to avoid requiring a remote cloud storage.

But perhaps, I am not using prefect as intended and failing to see an obvious solution :)

Jul 23 '22 18:07 UnderTheCarpet

thanks @clamydo, we understand - we are currently in the process of improving that UX - follow our announcements on Slack and Discourse to stay up to date

Jul 23 '22 22:07 anna-geller

@madkinsz @anna-geller I see that prefect.packaging.serializers.ImportSerializer still exists in the source, but can't find any working examples for it. Does that functionality work with the new Deployment paradigm, or did that only work for packaging flows via manifests during beta?

Unless I'm missing something, the only way I can see to replicate ModuleStorage with the latest release (2.0.4) is to use local storage and manually update the path/entrypoint in the generated deployment YAML to match where the code exists in our runtime environment.

I appreciate all the work being put into v2.0, and my team really wants to try out some of the new features, but the lack of an equivalent of ModuleStorage is really perplexing. Deploying code + dependencies via Docker images is a pretty standard software deployment pattern, and doing things totally differently for our Prefect-based code introduces a lot of friction.

Aug 11 '22 19:08 greenlaw

Does that functionality work with the new Deployment paradigm, or did that only work for packaging flows via manifests during beta?

Not yet. We'll be investigating ways to reincorporate functionality from the beta into the current deployments in the near future.

Aug 11 '22 20:08 zanieb

Module storage is already supported - dependencies are packaged now by default alongside flow in storage when you use remote storage blocks - check out https://discourse.prefect.io/t/deployments-are-now-simpler-and-declarative/1255

Aug 11 '22 20:08 anna-geller

Additionally, baking flow code into a Docker image will be possible very soon (within the 1-2 weeks) - I'm closing the issue because of that + the Docker storage recipe baking flow code into the image alongside dependencies will be linked on the same Discourse topic

Aug 11 '22 20:08 anna-geller

Thanks for the responses. I don't think it's correct to equate the Prefect 1.0 ModuleStorage with the remote storage options in Prefect 2.0. With Prefect 1.0, it allowed us to manage our own docker image build/deployment process and simply specify our flow's Python import path during flow registration. Basically, we could use our existing CI/CD process and simply link it to Prefect for flow orchestration. With Prefect 2.0, it seemingly wants to exert more control over the way our code/dependencies are stored and accessed during flow runs. I am glad to hear progress is being made, but I'm wondering about the details of the planned docker-based solutions. I just want a way to continue controlling our CI/CD process outside of Prefect while still allowing Prefect integration.

Aug 11 '22 20:08 greenlaw

manage our own docker image build/deployment process and simply specify our flow's Python import path during flow registration

it will work exactly the same way! follow the next releases

Aug 11 '22 22:08 anna-geller