Add support for ephemeral volume claims to kubernetes/argo
As defined here: https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes
This would allow a single step to have a dynamically attached "ephemeral volume" dedicated to itself (rather than a pvc which needs to be created before running the job)
Tested:
with hello_cloud.py example:
$ git diff
diff --git a/metaflow/tutorials/05-hello-cloud/hello-cloud.py b/metaflow/tutorials/05-hello-cloud/hello-cloud.py
index 20fcfe6..5240e16 100644
--- a/metaflow/tutorials/05-hello-cloud/hello-cloud.py
+++ b/metaflow/tutorials/05-hello-cloud/hello-cloud.py
@@ -1,4 +1,5 @@
from metaflow import FlowSpec, step, kubernetes, retry
+import time
class HelloCloudFlow(FlowSpec):
@@ -27,7 +28,7 @@ class HelloCloudFlow(FlowSpec):
self.next(self.hello)
- @kubernetes(cpu=1, memory=500)
+ @kubernetes(cpu=1, memory=500, ephemeral_volume_claims={"my-temp-volume": {"path": "/my_temp_volume"}})
@retry
@step
def hello(self):
@@ -41,6 +42,10 @@ class HelloCloudFlow(FlowSpec):
"""
self.message = "Hi from the cloud!"
print("Metaflow says: %s" % self.message)
+ with open("/my_temp_volume/my_file.txt", "w") as f:
+ f.write("hello_world!")
+ with open("/my_temp_volume/my_file.txt", "r") as f:
+ print("From file: %s" % f.read())
self.next(self.end)
@step
I've tested:
python metaflow/tutorials/05-hello-cloud/hello-cloud.py runpython metaflow/tutorials/05-hello-cloud/hello-cloud.py run --with kubernetes:ephemeral_volume_claims='{"my-temp-volume":{"path":"/my_temp_volume"}}'python metaflow/tutorials/05-hello-cloud/hello-cloud.py argo-workflows create + triggerAnd verified that the flow runs successfully, and that the ephemeral volume is created / destroyed as intended.
Also, I tested:
python metaflow/tutorials/05-hello-cloud/hello-cloud.py airflow create my_flow.pyreturns an error
I haven't yet tested @parallel because I don't have access to a cluster with JobSet installed.. If I did, I would run:
python test/parallel/parallel_test_flow.py run --with kubernetes:ephemeral_volume_claims='{"my-temp-volume":{"path":"/my_temp_volume"}}'
Also, can you help me with the scenarios you have been able to test (across @kubernetes, @parallel - locally as well as with argo-workflows and airflow) and the outputs. I am particularly curious about the unhappy paths - particularly what happens when a lot of data (TBs) is written to the EBS volume - how does that impact workload termination.
Additionally, the UX can be potentially simplified significantly - in line with the UX for persistent_volume_claims.
Also, can you help me with the scenarios you have been able to test (across @kubernetes, @parallel - locally as well as with argo-workflows and airflow) and the outputs. I am particularly curious about the unhappy paths - particularly what happens when a lot of data (TBs) is written to the EBS volume - how does that impact workload termination.
I'll work on running through the scenarios below:
- @kubernetes directly
- @kubernetes with argo
- @parallel
I don't have access to airflow, so that one will be harder to test
re: impact workload termination: is your concern that the step will fail to terminate if a lot of data is being written during the request to terminate?
Additionally, the UX can be potentially simplified significantly - in line with the UX for persistent_volume_claims.
Are you thinking just having ephemeral_volumes just be a dict[str, str] from name to path? I wasn't sure if users would want to customize the storage class or other parameters in spec, hence why I made them optional.
UPDATE: I added the tests I've run through to the PR description
tested this with @parallel and seems to be working as expected so we can cross that off the list.
I'm going back and forth regarding the UX with spec:
On one hand exposing the controls seems like a valid use case for fine-tuning the ephemeral storage. On the other hand it makes using the decorator rely on some Kubernetes implementation details (namely, what is accepted in spec) which makes usage cumbersome.
So far I'm slightly leaning towards passing the raw spec: due to easier maintainability.
On the other hand it makes using the decorator rely on some Kubernetes implementation details (namely, what is accepted in spec) which makes usage cumbersome.
At a minimum, I think you'd need:
- mount path
- storage class
- request size
FWIW this API has been "stable" since 1.25
@saikonen @savingoyal friendly ping here.. I've rebased, and this should be ready to go.