calrissian
calrissian copied to clipboard
Jobs with many parallel tasks cause GKE cluster failures
When testing a moderately parallelized job on GKE (24 exome samples, 48 files), the cluster API becomes unresponsive and triggers failures in the workflow. It's not failing because of the computations.
This is easily reproducible with a simple but highly parallel worfklow that generates 175 parallel containers: https://gist.github.com/dleehr/afdcde15aef9d727fd5226beddef126d
The above sleep-echo workflow doesn't seem to cause problems on a development cluster (Docker Desktop for Mac), so I'm surprised it can overwhelm a GKE cluster
Some ideas:
- local k8s was 1.14, GKE starts at 1.12. Try upgrading cluster
- Upgrade python k8s client to 9.0 (see #60 )
- Compare to openshift on local infrastructure
- Revisit threading logic. Threading logic is using cwltool
--parallel
. While running,docker stats
reported over 2000 PIDs for the calrissian container that orchestrates the workflow. This PID count includes processes and kernel threads. It's not clear that this is a problem but it may be a symptom. - Revisit design that launches a cluster API connection to watch/follow logs on every pod. Maintaining 175 parallel watcher connections to the cluster may not be a solid foundation
- Add retry logic to the cluster API calls like tenacity, https://github.com/Duke-GCB/DukeDSClient/pull/261