This PR makes the setting for the lifetime of Kubernetes jobs after they have completed configurable.

Testing Done

script without kubernetes decorator:

from metaflow import FlowSpec, step, kubernetes

class TestTTLFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.end)

    @step
    def end(self):
        print("TestTTLFlow is all done.")

if __name__ == "__main__":
  TestTTLFlow()

python flow.py run --with kubernetes => use ttl of 604800
python flow.py run --with kubernetes # with METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=100 in config.json => use ttl of 100
python flow.py run --with kubernetes:ttl_after_finished=300 => use ttl of 300
METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=400 python flow.py run --with kubernetes => use ttl of 400

script with decorator:

from metaflow import FlowSpec, step, kubernetes

class TestTTLFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.end)

    @kubernetes(ttl_after_finished=20)
    @step
    def end(self):
        print("TestTTLFlow is all done.")

if __name__ == "__main__":
  TestTTLFlow()

python flow.py run => use ttl of 600
METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=400 python flow.py run --with kubernetes => use ttl of 400 for the first step => use ttl of 600 for the kubernetes step

Feb 12 '23 08:02 derfred

Testing[329] @ e3a9cebdbbde424e9e34ad666b132d5673733221

Feb 13 '23 07:02 nflx-mf-bot

Testing[329] @ e3a9cebdbbde424e9e34ad666b132d5673733221 had 6 FAILUREs.

Feb 13 '23 10:02 nflx-mf-bot

@romain-intel I suspect the @nflx-mf-bot is executing internal tests. Can I see the output anywhere?

Feb 13 '23 11:02 derfred

Hey @derfred, thanks for looking into this. There are definitely use-cases for limiting the time for which the K8s job is kept within the cluster after it's completion.

Options related to Kubernetes can be expressed in a few different ways in Metaflow:

Env variable
Config file
Decorators
CLI option

Can you make sure that all of the above are covered in the PR? For reference, take a look at this relatively recent PR that added the support for K8s tolerations.

Minor comments:

Given that the K8s config option is called ttl_seconds_after_finished, would KUBERNETES_JOB_LIFETIME be confusing. Why not simply call the config option as KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED?

Also, it would help reviewing the PR immensely if you include the testing that was done with the code changes (See the same PR above as an example).

Feb 13 '23 17:02 shrinandj

@derfred : you can ignore the failures. Yes, they are internal tests but the failures in this case are ignorable (I am removing that test).

Feb 13 '23 17:02 romain-intel

@derfred - Is there any reason for the TTL to be only scoped to Kubernetes Jobs and not pods?

Feb 13 '23 18:02 savingoyal

@shrinandj I've added the additional configuration options and detailed the local testing I have done. Could you have another look at it?

Feb 28 '23 13:02 derfred

@savingoyal In my testing the pods are cleaned up by the kubernetes garbage collector automatically when the job is deleted

Feb 28 '23 13:02 derfred

Any update on this? Would love to configure the job lifetime!

Aug 31 '23 10:08 tslott

Add config option for lifetime of Kubernetes jobs

Testing Done

script without kubernetes decorator:

script with decorator: