deepspeed-v1 prep notes
Please edit this Issue to collect notes for deepspeed v1 work
TODO items
Start adding TODOV1 comments where needed in preparation for changes, e.g. so far used in changing the logger defaults
Example:
https://github.com/deepspeedai/DeepSpeed/blob/066d912052b5eaf6094a2e57d20a163ba6517db8/deepspeed/launcher/launch.py#L105-L108
Backward compatibility breaking
This is an opportunity to redesign some APIs, change defaults to better ways. We of course, should try to minimize any breakage.
Also we should consider a back-compat module which if important will try to restore the old functionality where possible to easy with transition.
Logging subsystem
-
change the default logging levels to
logging.WARNINGsee TODOV1 tags in code -
we need to untangle 3 different loggers and streamline them into ideally 1. Currently we have:
- Command-line for launcher
- ds_config for DS engine
- builder
probably we should clean up deepspeed.utils.__init__ as it shouldn't load builder code. If we are just importing the logger. probably should leave logger out of depspeed.utils.__init__ and use a direct from deepspeed.utils.logging import logger
- we want per module log levels - so for example if
wall_clock_breakdown: trueit should be printing stats regardless of log level and currently it doesn't (I'm adding a workaround to useprintinstead).
Collectives / Comms
- Process group management: switch to device mesh to modernize deepspeed
- where possible drop the custom functional collective API and replace with optimized torch.distributed API (which didn't exist when deepspeed was created)
Related discussions:
- https://github.com/deepspeedai/DeepSpeed/pull/7526#discussion_r2312645622
@stas00 We can take a closer look at refreshing process group management in the next couple of months. May I know if you have any detailed expectation on the refreshed APIs and internals?
Here're my initial thoughts:
- At API level, obsolete
mpuandmesh_paramofdeepspeed_initialize. It looks better to me to infer the mesh device topology according to the user-provided config (like what is done today for sequence + data parallel). With more potential combinations of different parallelism, accepting amesh_paramwith an assumed structure becomes error-prone. - Use a DeviceMesh as
PipelineModule._grid. - Refine (or possibly abandon?)
deepspeed.utils.groupsand replace invocations to its APIs with DeviceMesh ones. - As for the supported combinations, here's what in my mind:
- data + sequence
- data + pipeline + model (or, to be accurate, tensor)
- How about torch-HSDP-style hybrid data parallel?
@eternalNight, would it be better to discuss each of these sub-plans in a separate Issue and keep this one focused mainly for the agreed upon tentative plans - we can then link to each discussion Issue from the OP for background?