Ma, Guokai comments

Results 180 comments of


                                            Ma, Guokai

(Do not merge) (CPU) aggregation of few recent fixes/optimizations

Hi @loadams This PR has some new changes that is working on merge into master, I have updated PR description. Can you help reopen this PR with draft mode? Thanks!...

Support cpu tensors without direct device invocation

> @abhilash1910, thanks for this PR. I think this PR needs some work that leverages PR #3633 for the following reasons. > > 1. As you observed, strings like `torch.cpu.DoubleTensor`...

Support cpu tensors without direct device invocation

> Yes, dtype is better. Some additional changed in _reduce_non_expert_gradients and _reduce_expert_gradients will be needed accordingly.

Support cpu tensors without direct device invocation

Hi @abhilash1910 can you check and fix the following error? https://github.com/microsoft/DeepSpeed/actions/runs/6952457019/job/18941598524?pr=3842#step:8:4568 ``` File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 126, in split_half_float_double_sparse assert t.dtype in supported_types, f"attempting to reduce an unsupported grad type: {t.dtype}"...

Support cpu tensors without direct device invocation

Hi @abhilash1910 can you clarify whether current failures in CI is related to your PR or just a test issue? Thanks!

Support cpu tensors without direct device invocation

Hi @abhilash1910 some suggestions: 1. provide more details (hw, sw, log ...) of your local run so there might be hint of difference. 2. try to modify the test as...

fix opt-350m shard loading issue in AutoTP

@tjruwase @jeffra could assign a reviewer for this PR? This PR fix OPT checkpoint sharded loading with AutoTP and improve OPT+AutoTP usability, it is needed when run OPT models on...

fix opt-350m shard loading issue in AutoTP

@RezaYazdaniAminabadi can you review this PR? This PR fix OPT sharded loading for AutoTP. Previously only OPT-125m has sharded checkpoint loading, with this fix OPT >350m will have sharded checkpoint...

fix opt-350m shard loading issue in AutoTP

@RezaYazdaniAminabadi Hi, a quick check whether this PR is still under consideration. We have verified this PR for CPU accelerator and like to know whether it could be merged into...

Enable auto TP policy for llama model

Does it make sense to also update [docs/_tutorials/automatic-tensor-parallism.md](https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/automatic-tensor-parallelism.md) to include this model in supported list?