DeepSpeed
DeepSpeed copied to clipboard
Add Cache to Comm Group
This adds a global cache for creating new comm groups. Rather than returning unique objects, an identical group (same backend, same ranks) will share a single object. The motivation for including a cache like this is each creation of a group can reserve a small amount of memory that is not reclaimed when the object itself is freed. This creates a memory leak for a long-running application like MII where models may be destroyed and created many times.
Areas for feedback:
- Does this have any side effect for training workloads? Are there any assumptions that the group objects need to be unique?
- Does this make sense to have as an optional opt-in behavior or built with an additional wrapper, rather integrating it directly into
new_group