DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Add Cache to Comm Group

Open cmikeh2 opened this issue 1 year ago • 0 comments

This adds a global cache for creating new comm groups. Rather than returning unique objects, an identical group (same backend, same ranks) will share a single object. The motivation for including a cache like this is each creation of a group can reserve a small amount of memory that is not reclaimed when the object itself is freed. This creates a memory leak for a long-running application like MII where models may be destroyed and created many times.

Areas for feedback:

  1. Does this have any side effect for training workloads? Are there any assumptions that the group objects need to be unique?
  2. Does this make sense to have as an optional opt-in behavior or built with an additional wrapper, rather integrating it directly into new_group

cmikeh2 avatar Dec 20 '23 22:12 cmikeh2