Ma, Guokai issues

Results 19 issues of


                                            Ma, Guokai

Better core binding in torch.backends.xeon.run_cpu when launced from torchrun with --nproc-per-node

This PR fix `torch.backends.xeon.run_cpu` behavior when it is launched from `torchrun` with `--nproc-per-node` parameter. As a CPU launcher, `run_cpu` would bind cores to each instance it launches using `numactl`, and...

triaged

open source

ciflow/trunk

Ran 0 batches with an average size of -nan

I met this when I run evaluation with small parallel number of games: `Ran 0 batches with an average size of -nan ` Is it possible that a ModelBatcher might...

[mlperf] Should have exit_win_rate in rl_loop flags to define exit_win_rate

Currently it is a magic number (0.55) in https://github.com/tensorflow/minigo/blob/master/ml_perf/eval_models.py#L47

[XPU] CUDA error when running on arc770 with Intel extension for pytorch

@abhilash1910 is XPU support for qlora still working? I tried to run it on a linux arc770 system at home but got the following error: $ python qlora.py --model_name_or_path facebook/opt-350m...

Add --client-only arg to mii benchmark

This PR add a --client-only flag to mii benchmark, allows the benchmark skip `start_server` and `stop_server` when running with backend such as vllm. This flag provide the flexibility to start...

Add openai client to deepspeedometer

This PR adds a new client which can test performance of LLM serving conforms to OpenAI API. This gives the flexibility of start a server seperately and benchmark that server...

[TRACKER] Customer support related PR tracker for Intel devices

This issue acted as a PR tracker to Intel customer support related PRs. The purpose is to get understanding of what each PR does and how important are they compared...

enhancement

One example, multiple config files

I'm wondering if we can take the ZenFlow finetuning example, and extend this example into a test bed of different DeepSpeed technologies. The ZenFlow finetuning example: https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/DeepSpeed-ZenFlow/finetuning The reason is...

[BUG] Improper coupling of paramter list between DeepSpeedZeroOptimizer_Stage3 and SuperOffloadOptimizer_Stage3

**Describe the bug** DeepSpeedZeroOptimizer_Stage3 and SuperOffloadOptimizer_Stage3 shares same parameter list, which would cause divergence easily ** Details ** In https://github.com/deepspeedai/DeepSpeed/blob/b7cd78f096016ae67a11ef6292eba28e0452b4e7/deepspeed/runtime/engine.py#L1846 , `DeepSpeedZeroOptimizer_Stage3` and `SuperOffloadOptimizer_Stage3` initializer shares same parameter list. This...

bug

training