Zhen Zhang

Results 7 issues of Zhen Zhang

**Is your feature request related to a problem? Please describe.** Instead of partitioning the model to all devices as ZeRO-3 did, using subgroup sharding and hierarchical communications can improve the...

enhancement

This PR drafts a tentative implementation of MiCS (https://arxiv.org/abs/2205.00119), which is first discussed in #2801 For trying out the implementation, you can test it with a toy model at https://github.com/microsoft/DeepSpeed/blob/fdb8706a5f0b7564fe92d20cbeea460c2b569983/tests/small_model_debugging/test_mics_config.py...

- Only the first partition group will save the model checkpoints - Need to avoid call `dist.barrier` on the `WORLD` group. - Including the support for loading the partitioned model...

In response to the ask from https://github.com/microsoft/DeepSpeed/pull/2964#issuecomment-1832161865, I added three more unit tests related to MiCS. There are two knowledge issues: - Testing on Torch 2.1.0 triggers `_IllegalWorker` in coalesced...

Hi there, Thanks for sharing the implementation of bamboo! We are able to build the docker environment after fixing the key as following. But not sure how to run the...

Hi, I am building master branch. I have installed `tachyon` using `install_tachyon.sh`. While building `velox-modelserver` with command `mvn package` it raises a lot of errors related to one class, ClientStore,...

Hi @geoffxy I would like to run skyline in dev-mode. With running `./dev-setup.sh` I got following errors: ``` ERROR: Command errored out with exit status 1: command: /home/ubuntu/skyline/cli/env/bin/python -u -c...

bug