Alexander Zhipa
Alexander Zhipa
making it possible to create a previously non-docked floating DockNode, removing DockNode's internal BorderPane min width/height augmentation as it result is visual artifacts for floating nodes (looks like it was...
**Is your feature request related to a problem? Please describe.** A [paper](https://arxiv.org/abs/2202.09368) was published regarding potentially better token-expert routing for MoE that leaves less experts under-trained. **Describe the solution you'd...
I faced this cryptic error while running tests on a device with a single GPU. DeepSpeed: `master` PyTroch: `1.12.1` NCCL: `2.10.3` ### Current Behavior Steps to reproduce: 1. `pytest tests/unit/checkpoint/test_moe_checkpoint.py...
When `load_optimizer_states=False` is used for MoE `load_checkpoint` - do not attempt to load the optimizer state files. This currently fails as DeepSpeed still attempts to load those, even though they...
A simple check to make sure we compare `partition_id`s for the same `process_group`. Fix #3521
**Describe the bug** When decision about coalescing adjacent tensors is made right now it is currently only based on `partition_id` [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L923) which is simply target's rank in the `range(dist.get_world_size(group=process_group))`. This...
Recreating previous pull request: https://github.com/microsoft/DeepSpeed/pull/3522 Fixed #3521
Most of the changes in this PR are opinionated with the main goal to start a discussion. I'm happy to address the feedback, update the PR and documentation accordingly. Here's...
### Description & Motivation We want to log_artifact, publish tags (e.g. with `MLFlowLogger`) only _after_ saving the checkpoint is complete (successfully). With `AsyncCheckpointIO` with does not seem to have a...
*Issue #, if available:* Related to https://github.com/aws/sagemaker-training-toolkit/pull/205 *Description of changes:* `torch_distributed` uses `torchrun` which supports running python modules via `-m `. Currently SageMaker limits torch_distributed to scripts only. This change...