calrissian Jobs with many parallel tasks cause GKE cluster failures

Jobs with many parallel tasks cause GKE cluster failures

Open dleehr opened this issue 5 years ago • 16 comments

When testing a moderately parallelized job on GKE (24 exome samples, 48 files), the cluster API becomes unresponsive and triggers failures in the workflow. It's not failing because of the computations.

This is easily reproducible with a simple but highly parallel worfklow that generates 175 parallel containers: https://gist.github.com/dleehr/afdcde15aef9d727fd5226beddef126d

The above sleep-echo workflow doesn't seem to cause problems on a development cluster (Docker Desktop for Mac), so I'm surprised it can overwhelm a GKE cluster

Some ideas:

local k8s was 1.14, GKE starts at 1.12. Try upgrading cluster
Upgrade python k8s client to 9.0 (see #60 )
Compare to openshift on local infrastructure
Revisit threading logic. Threading logic is using cwltool --parallel. While running, docker stats reported over 2000 PIDs for the calrissian container that orchestrates the workflow. This PID count includes processes and kernel threads. It's not clear that this is a problem but it may be a symptom.
Revisit design that launches a cluster API connection to watch/follow logs on every pod. Maintaining 175 parallel watcher connections to the cluster may not be a solid foundation
Add retry logic to the cluster API calls like tenacity, https://github.com/Duke-GCB/DukeDSClient/pull/261

Aug 12 '19 15:08 dleehr

calrissian calrissian copied to clipboard

Jobs with many parallel tasks cause GKE cluster failures

calrissian
calrissian copied to clipboard