Add config option for lifetime of Kubernetes jobs
This PR makes the setting for the lifetime of Kubernetes jobs after they have completed configurable.
Testing Done
script without kubernetes decorator:
from metaflow import FlowSpec, step, kubernetes
class TestTTLFlow(FlowSpec):
@step
def start(self):
self.next(self.end)
@step
def end(self):
print("TestTTLFlow is all done.")
if __name__ == "__main__":
TestTTLFlow()
- python flow.py run --with kubernetes => use ttl of 604800
- python flow.py run --with kubernetes # with METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=100 in config.json => use ttl of 100
- python flow.py run --with kubernetes:ttl_after_finished=300 => use ttl of 300
- METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=400 python flow.py run --with kubernetes => use ttl of 400
script with decorator:
from metaflow import FlowSpec, step, kubernetes
class TestTTLFlow(FlowSpec):
@step
def start(self):
self.next(self.end)
@kubernetes(ttl_after_finished=20)
@step
def end(self):
print("TestTTLFlow is all done.")
if __name__ == "__main__":
TestTTLFlow()
- python flow.py run => use ttl of 600
- METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=400 python flow.py run --with kubernetes => use ttl of 400 for the first step => use ttl of 600 for the kubernetes step
Testing[329] @ e3a9cebdbbde424e9e34ad666b132d5673733221
Testing[329] @ e3a9cebdbbde424e9e34ad666b132d5673733221 had 6 FAILUREs.
@romain-intel I suspect the @nflx-mf-bot is executing internal tests. Can I see the output anywhere?
Hey @derfred, thanks for looking into this. There are definitely use-cases for limiting the time for which the K8s job is kept within the cluster after it's completion.
Options related to Kubernetes can be expressed in a few different ways in Metaflow:
- Env variable
- Config file
- Decorators
- CLI option
Can you make sure that all of the above are covered in the PR? For reference, take a look at this relatively recent PR that added the support for K8s tolerations.
Minor comments:
- Given that the K8s config option is called
ttl_seconds_after_finished, wouldKUBERNETES_JOB_LIFETIMEbe confusing. Why not simply call the config option asKUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED?
Also, it would help reviewing the PR immensely if you include the testing that was done with the code changes (See the same PR above as an example).
@derfred : you can ignore the failures. Yes, they are internal tests but the failures in this case are ignorable (I am removing that test).
@derfred - Is there any reason for the TTL to be only scoped to Kubernetes Jobs and not pods?
@shrinandj I've added the additional configuration options and detailed the local testing I have done. Could you have another look at it?
@savingoyal In my testing the pods are cleaned up by the kubernetes garbage collector automatically when the job is deleted
Any update on this? Would love to configure the job lifetime!