DeepSpeed deepspeed-v1 prep notes

Please edit this Issue to collect notes for deepspeed v1 work

TODO items

Start adding TODOV1 comments where needed in preparation for changes, e.g. so far used in changing the logger defaults

Example:

https://github.com/deepspeedai/DeepSpeed/blob/066d912052b5eaf6094a2e57d20a163ba6517db8/deepspeed/launcher/launch.py#L105-L108

Backward compatibility breaking

This is an opportunity to redesign some APIs, change defaults to better ways. We of course, should try to minimize any breakage.

Also we should consider a back-compat module which if important will try to restore the old functionality where possible to easy with transition.

Logging subsystem

change the default logging levels to logging.WARNING see TODOV1 tags in code
we need to untangle 3 different loggers and streamline them into ideally 1. Currently we have:
- Command-line for launcher
- ds_config for DS engine
- builder

probably we should clean up deepspeed.utils.__init__ as it shouldn't load builder code. If we are just importing the logger. probably should leave logger out of depspeed.utils.__init__ and use a direct from deepspeed.utils.logging import logger

we want per module log levels - so for example if wall_clock_breakdown: true it should be printing stats regardless of log level and currently it doesn't (I'm adding a workaround to use print instead).

Collectives / Comms

Process group management: switch to device mesh to modernize deepspeed
where possible drop the custom functional collective API and replace with optimized torch.distributed API (which didn't exist when deepspeed was created)

Related discussions:

https://github.com/deepspeedai/DeepSpeed/pull/7526#discussion_r2312645622

Sep 02 '25 20:09 stas00

@stas00 We can take a closer look at refreshing process group management in the next couple of months. May I know if you have any detailed expectation on the refreshed APIs and internals?

Here're my initial thoughts:

At API level, obsolete mpu and mesh_param of deepspeed_initialize. It looks better to me to infer the mesh device topology according to the user-provided config (like what is done today for sequence + data parallel). With more potential combinations of different parallelism, accepting a mesh_param with an assumed structure becomes error-prone.
Use a DeviceMesh as PipelineModule._grid.
Refine (or possibly abandon?) deepspeed.utils.groups and replace invocations to its APIs with DeviceMesh ones.
As for the supported combinations, here's what in my mind:
- data + sequence
- data + pipeline + model (or, to be accurate, tensor)
- How about torch-HSDP-style hybrid data parallel?

Oct 27 '25 04:10 eternalNight

@eternalNight, would it be better to discuss each of these sub-plans in a separate Issue and keep this one focused mainly for the agreed upon tentative plans - we can then link to each discussion Issue from the OP for background?

Oct 27 '25 17:10 stas00