metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

Add config option for lifetime of Kubernetes jobs

Open derfred opened this issue 2 years ago • 9 comments

This PR makes the setting for the lifetime of Kubernetes jobs after they have completed configurable.

Testing Done

script without kubernetes decorator:

from metaflow import FlowSpec, step, kubernetes

class TestTTLFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.end)

    @step
    def end(self):
        print("TestTTLFlow is all done.")

if __name__ == "__main__":
  TestTTLFlow()
  • python flow.py run --with kubernetes => use ttl of 604800
  • python flow.py run --with kubernetes # with METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=100 in config.json => use ttl of 100
  • python flow.py run --with kubernetes:ttl_after_finished=300 => use ttl of 300
  • METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=400 python flow.py run --with kubernetes => use ttl of 400

script with decorator:

from metaflow import FlowSpec, step, kubernetes

class TestTTLFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.end)

    @kubernetes(ttl_after_finished=20)
    @step
    def end(self):
        print("TestTTLFlow is all done.")

if __name__ == "__main__":
  TestTTLFlow()
  • python flow.py run => use ttl of 600
  • METAFLOW_KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED=400 python flow.py run --with kubernetes => use ttl of 400 for the first step => use ttl of 600 for the kubernetes step

derfred avatar Feb 12 '23 08:02 derfred

Testing[329] @ e3a9cebdbbde424e9e34ad666b132d5673733221

nflx-mf-bot avatar Feb 13 '23 07:02 nflx-mf-bot

Testing[329] @ e3a9cebdbbde424e9e34ad666b132d5673733221 had 6 FAILUREs.

nflx-mf-bot avatar Feb 13 '23 10:02 nflx-mf-bot

@romain-intel I suspect the @nflx-mf-bot is executing internal tests. Can I see the output anywhere?

derfred avatar Feb 13 '23 11:02 derfred

Hey @derfred, thanks for looking into this. There are definitely use-cases for limiting the time for which the K8s job is kept within the cluster after it's completion.

Options related to Kubernetes can be expressed in a few different ways in Metaflow:

  • Env variable
  • Config file
  • Decorators
  • CLI option

Can you make sure that all of the above are covered in the PR? For reference, take a look at this relatively recent PR that added the support for K8s tolerations.

Minor comments:

  • Given that the K8s config option is called ttl_seconds_after_finished, would KUBERNETES_JOB_LIFETIME be confusing. Why not simply call the config option as KUBERNETES_JOB_TTL_SECONDS_AFTER_FINISHED?

Also, it would help reviewing the PR immensely if you include the testing that was done with the code changes (See the same PR above as an example).

shrinandj avatar Feb 13 '23 17:02 shrinandj

@derfred : you can ignore the failures. Yes, they are internal tests but the failures in this case are ignorable (I am removing that test).

romain-intel avatar Feb 13 '23 17:02 romain-intel

@derfred - Is there any reason for the TTL to be only scoped to Kubernetes Jobs and not pods?

savingoyal avatar Feb 13 '23 18:02 savingoyal

@shrinandj I've added the additional configuration options and detailed the local testing I have done. Could you have another look at it?

derfred avatar Feb 28 '23 13:02 derfred

@savingoyal In my testing the pods are cleaned up by the kubernetes garbage collector automatically when the job is deleted

derfred avatar Feb 28 '23 13:02 derfred

Any update on this? Would love to configure the job lifetime!

tslott avatar Aug 31 '23 10:08 tslott