metaflow
metaflow copied to clipboard
Support simple dumping of Argo manifests
Support a --dump-manifests flag or subcommand that only constructs the argo-workflows manifest(s).
We use ArgoCD extensively for basically all our resources. I'm not very keen on using the existing deployment of argo-workflows create to deploy. I'd rather dump the manifests out into a kustomization app that ArgoCD then evaluates for actual deployment.
I know about the --only-json but it seems clear it's trying to access and modify cluster things:
❯ METAFLOW_DATASTORE_SYSROOT_S3=s3://s3.my-on-prem-cluster.intra.example.com/argo/workflows python test_metaflow.py --datastore=s3 argo-workflows create --only-json --namespace argo
Metaflow 2.15.4 executing ParameterFlow for user:milesg
Validating your flow...
The graph looks good!
Running pylint...
Pylint not found, so extra checks are disabled.
Deploying parameterflow to Argo Workflows...
It seems this is the first time you are deploying parameterflow to Argo Workflows.
A new production token generated.
The namespace of this production flow is
production:parameterflow-0-gtqo
To analyze results of this production flow add this line in your notebooks:
namespace("production:parameterflow-0-gtqo")
If you want to authorize other people to deploy new versions of this flow to Argo Workflows, they need to call
argo-workflows create --authorize parameterflow-0-gtqo
when deploying this flow to Argo Workflows for the first time.
See "Organizing Results" at https://docs.metaflow.org/ for more information about production tokens.
S3 access denied:
s3://[s3.my-on-prem-cluster.intra.example.com](s3://s3.my-on-prem-cluster.intra.example.com/argo/workflows/ParameterFlow/data/ff/fffd58f887cd06e42fe5b841acba70b0781344dd
I could go down the road of providing proper access, but the point is I don't want it doing any of this; metaflow will eventually exist on our cluster thru Helm installation (happy to provide a PR for a helm chart pre-configured for on-prem installs later if this works out), and I only want the argo workflow manifest(s) which I will then feed to ArgoCD like all our other resources/services.
I'm happy to dedicate some time to implement this if it's not already and I just missed it. I would be willing to contribute this feature with guidance from the community.
I need this feature!!
Re-opening this after having spent some time using Metaflow. I'm quite convinced now that there ought to be some way to integrate the state of my code base to what should be on the cluster.
Right now, if one adds, then removes a schedule or trigger_on_finish or even deletes a flow, those resources will become orphaned on the cluster from a prior deployment. It makes sense of course as there is nothing in Metaflow tying code state and the cluster resources together.
The only way I can see to do this from the design of Metaflow is to keep some ability of the code state and allow something like ArgoCD to reconcile required updates.
I see maybe even the --only-json is going away? (https://github.com/Netflix/metaflow/pull/2339). Turns out I'm fine w/ --only-json uploading code to S3 as this is how Metaflow is designed to work w/ Argo Workflows, but I'm not fine with there being no ability to represent state changes represented by the code. Which can be done with --only-json.
To be clear, I think it's great one can run ad-hoc / development flows in a development cluster that gets cleaned up regularly, but in production we need a more declarative approach.
I'm wondering, if there is no easy way to dump Argo Workflows yaml|json manifests, what is recommended way to deploy Metaflow pipelines on prod environment? Manually from local machines? Me as I suppose many prefer to use CI/CD + GitOps approach and not clear it can be done with current Metaflow cli.
@anozdryn-platnitski why isn't executing python flow.py argo-workflows create from CI an option?
why isn't executing python flow.py argo-workflows create from CI an option?
As mentioned in the opening issue here, resources become orphaned once flow.py changes; repeating myself here but, removing a cron schedule will not remove said cron schedule by running that command in CI again. There's a reason GitOps is a popular practice. Also, as with things like cdk8s transpiling into manifests, having the actual manifests stored in git gives an objective/plain-as-day look at what's happening which is super helpful to know what changes are about to be applied by things like ArgoCD / Flux before merging a PR. Sure you can deduce what might/should happen from code changes, but nothing beats seeing the manifest diffs.
With that said, this was such a deal breaker, we dropped Metaflow unfortunately.
Extended thoughts:
I think part of the pickle metaflow has gotten itself into here (at least in Argo world) is the requirement of uploading/downloading code packages from S3 or some storage layer which needs to be available at manifest rendering time. It really ought to just map the defined flow location in the manifests and invoke it at container runtime. Then there is no dependency on external storage, no garbage artifacts to clean up after. 'Only' a bit of importlib and you're on your way - just reuse the code the container will have anyway. We've been implementing our own in-house solution this way and it's going well - hope to open source it later this year. (Albeit a trimmed down version of Metaflow; not a competitor really - focus on only defining Argo workflows/sensors/schedules w/ a nice pythonic API)
@anozdryn-platnitski why isn't executing
python flow.py argo-workflows createfrom CI an option?
THis is not an option, due to security. My CI/CD agent need to have access to production k8s cluster in this case. This limitation can be important for my organisation. Sad, because ML engineers happy to use Metaflow.
@milesgranger looking forward to see the product and try it out.
As mentioned in the opening issue here, resources become orphaned once flow.py changes; repeating myself here but, removing a cron schedule will not remove said cron schedule by running that command in CI again
The cron schedule is removed when you run the command in CI again. We take special care (and go to great lengths) in ensuring there are no orphaned resources. Also, much of these resource manifests are not entirely human-readable; so exposing them as an interface as limited utility. The use case of metaflow is not to provide a pythonic api for argo workflows (there are many excellent projects in that direction, like hera) but an opinionated framework that offers argo-workflows as a deployment target.
Also, it is technically possible to capture the POST requests to k8s-api server and dump them into a folder for further consumption by utilizing sitecustomize.py - this piece of code can be cleanly maintained and evolved separately according to the needs of your org.
I don't understand what that link is proving. It in fact addresses a scenario where people can end up with duplicate workflows because of naming. I'm quite confused by you linking to that since it seems to have nothing to do with what I just said.
It definitely does not remove orphaned cron workflows. Nor does it remove orphaned workflows themselves, unless you've also implemented an ArgoCD or flux system here, it generates manifests and applies them. I mean literally how could it cleanup an orphaned workflow, if removing the python script would also remove the workflow(!?). So when removing a cron or entire workflow there is no mechanism to remove those from a cluster.
To be clear, I'm not saying Metaflow should do this, only that what was conveyed in the original issue to dump manifests would allow Metaflow to also play nice with gitops style systems like ArgoCD and flux.
And say what you will about not being meant for a pythonic api into Argo, but it's already better than Hera and also why Hera is aware of their own shortcomings when it comes to their API and have a proposal to make it better. Revisiting the Metaflow website, the whole thing seems geared to how nice the API is to make workflows... Which also targets Argo...great news! But having static manifests would be the cherry on top! 👍
As a more explicit example -
- deploying this flow will generate a cron workflow
- commenting out the
@scheduledecorator and deploying again will remove the cron workflow from step 1 while still preserving the workflow template
maybe your definition of orphaned cron workflows is different than mine but very pointedly this fact is not true -
As mentioned in the opening issue here, resources become orphaned once flow.py changes; repeating myself here but, removing a cron schedule will not remove said cron schedule by running that command in CI again.
Ignore for a moment the cron workflow. If you want to delete a workflow, how do to that outside of manual intervention?
Should you take a moment to see we're trying to help Metaflow improve, you'd see that one could generate manifests of all workflows, when one is deleted in code, then ArgoCD / flux would notice and delete it.
I can see this is going no where. I've tried to help, but I don't really care anymore since we've dropped Metaflow anyhow.