Feat: Improve UX of pytorch-elastic plugin by configuring reasonable defaults
Tracking issue
Closes https://github.com/flyteorg/flyte/issues/5339
Why are the changes needed?
As outlined in the tracking issue, when using the pytorch elastic plugin, users almost always have to configure the following on their own:
- A pod template with a volume/volume mount which increases the shared memory as the default shared memory segment size of the container is very often too small when torch multiprocessing is used (e.g. for multithreaded data loaders).
- A higher join timeout because it easily happens that some worker pods start quicker than others, causing the rendezvous of the workers to fail due to timeouts.
What changes were proposed in this pull request?
Configure reasonable defaults to improve the UX for users:
-
Add a flag to
task_config=Elastic()andPyTorch()which allows adding such a shared memory volume to the pod template. The flag defaults to true. -
Configure reasonable join timeouts of 15 minutes.
15 minutes was chosen as an estimate for the time difference between the startup of a pod which is immediately assigned to a running node which has the image pulled (a few seconds) and a pod which requires a node to be scaled up and the image to be pulled.
If users require a larger timeout, they can of course increase the values but should likely rather use a gang scheduler as described here.
How was this patch tested?
- Added unit tests
- Ran pytorch and elastic pytorch tasks in a cluster and ensured that the volume/volume mount is added
Sentence for the release notes:
@wild-endeavor
The distributed pytorch and distributed elastic-pytorch tasks in
flytekitplugins-kfpytorchby default increase the shared memory limit by mounting anemptyDirvolume with mediumMemoryto to/dev/shmas this is almost always required when working with torch multiprocessing (e.g. multi-processed data loader workers or local worker group in distributed training). To disable this, passincrease_shared_mem=Falsetotask_config=PyTorch/Elastic.Elastictasks now also set a default join timeout of 15 minutes to prevent timeouts when some worker pods require a node scale-up. This setting can be modified viatask_config=Elastic(rdzv_configs{...}).
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [ ] All new and existing tests passed.
- [x] All commits are signed-off.