DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

deepspeed-v1 prep notes

Open stas00 opened this issue 3 months ago • 2 comments

Please edit this Issue to collect notes for deepspeed v1 work

TODO items

Start adding TODOV1 comments where needed in preparation for changes, e.g. so far used in changing the logger defaults

Example:

https://github.com/deepspeedai/DeepSpeed/blob/066d912052b5eaf6094a2e57d20a163ba6517db8/deepspeed/launcher/launch.py#L105-L108

Backward compatibility breaking

This is an opportunity to redesign some APIs, change defaults to better ways. We of course, should try to minimize any breakage.

Also we should consider a back-compat module which if important will try to restore the old functionality where possible to easy with transition.

Logging subsystem

  1. change the default logging levels to logging.WARNING see TODOV1 tags in code

  2. we need to untangle 3 different loggers and streamline them into ideally 1. Currently we have:

    • Command-line for launcher
    • ds_config for DS engine
    • builder

probably we should clean up deepspeed.utils.__init__ as it shouldn't load builder code. If we are just importing the logger. probably should leave logger out of depspeed.utils.__init__ and use a direct from deepspeed.utils.logging import logger

  1. we want per module log levels - so for example if wall_clock_breakdown: true it should be printing stats regardless of log level and currently it doesn't (I'm adding a workaround to use print instead).

Collectives / Comms

  • Process group management: switch to device mesh to modernize deepspeed
  • where possible drop the custom functional collective API and replace with optimized torch.distributed API (which didn't exist when deepspeed was created)

Related discussions:

  • https://github.com/deepspeedai/DeepSpeed/pull/7526#discussion_r2312645622

stas00 avatar Sep 02 '25 20:09 stas00

@stas00 We can take a closer look at refreshing process group management in the next couple of months. May I know if you have any detailed expectation on the refreshed APIs and internals?

Here're my initial thoughts:

  • At API level, obsolete mpu and mesh_param of deepspeed_initialize. It looks better to me to infer the mesh device topology according to the user-provided config (like what is done today for sequence + data parallel). With more potential combinations of different parallelism, accepting a mesh_param with an assumed structure becomes error-prone.
  • Use a DeviceMesh as PipelineModule._grid.
  • Refine (or possibly abandon?) deepspeed.utils.groups and replace invocations to its APIs with DeviceMesh ones.
  • As for the supported combinations, here's what in my mind:
    • data + sequence
    • data + pipeline + model (or, to be accurate, tensor)
    • How about torch-HSDP-style hybrid data parallel?

eternalNight avatar Oct 27 '25 04:10 eternalNight

@eternalNight, would it be better to discuss each of these sub-plans in a separate Issue and keep this one focused mainly for the agreed upon tentative plans - we can then link to each discussion Issue from the OP for background?

stas00 avatar Oct 27 '25 17:10 stas00