Feat: Improve UX of pytorch-elastic plugin by configuring reasonable defaults

Open fg91 opened this issue 1 year ago • 2 comments

Tracking issue

Closes https://github.com/flyteorg/flyte/issues/5339

Why are the changes needed?

As outlined in the tracking issue, when using the pytorch elastic plugin, users almost always have to configure the following on their own:

A pod template with a volume/volume mount which increases the shared memory as the default shared memory segment size of the container is very often too small when torch multiprocessing is used (e.g. for multithreaded data loaders).
A higher join timeout because it easily happens that some worker pods start quicker than others, causing the rendezvous of the workers to fail due to timeouts.

What changes were proposed in this pull request?

Configure reasonable defaults to improve the UX for users:

Add a flag to task_config=Elastic() and PyTorch() which allows adding such a shared memory volume to the pod template. The flag defaults to true.
Configure reasonable join timeouts of 15 minutes.

15 minutes was chosen as an estimate for the time difference between the startup of a pod which is immediately assigned to a running node which has the image pulled (a few seconds) and a pod which requires a node to be scaled up and the image to be pulled.

If users require a larger timeout, they can of course increase the values but should likely rather use a gang scheduler as described here.

How was this patch tested?

Added unit tests
Ran pytorch and elastic pytorch tasks in a cluster and ensured that the volume/volume mount is added

Sentence for the release notes:

@wild-endeavor

The distributed pytorch and distributed elastic-pytorch tasks in flytekitplugins-kfpytorch by default increase the shared memory limit by mounting an emptyDir volume with medium Memory to to /dev/shm as this is almost always required when working with torch multiprocessing (e.g. multi-processed data loader workers or local worker group in distributed training). To disable this, pass increase_shared_mem=False to task_config=PyTorch/Elastic. Elastic tasks now also set a default join timeout of 15 minutes to prevent timeouts when some worker pods require a node scale-up. This setting can be modified via task_config=Elastic(rdzv_configs{...}).

Check all the applicable boxes

[ ] I updated the documentation accordingly.
[ ] All new and existing tests passed.
[x] All commits are signed-off.

Jul 01 '24 20:07 fg91