DeepSpeedExamples issues

Results 274 DeepSpeedExamples issues

Sort by recently updated

Why is the shape of rm model all 0

************************[start] Initializing Reward Model [start] ************************ [2023-11-29 14:57:02,054] [INFO] [partition_parameters.py:347:__exit__] finished initializing model - num_params = 1306, num_elems = 39.25B >Creating model from_config took 0.365234375 seconds >Creating model from_config took...

Pattaro

Question: Why not padding to the same sequence length within the batch during the sft training phase?

Question: In the SFT training phase in dschat, I found that function `create_dataset_split` in data_utils.py will pad the samples to the maximum length. Therefore, why not dynamically padding to the...

LKLKyy

Why does the rw_eval.py script return a different score for the same sample each time?

1. **model**: llama-2-7b-hf 2. **execute command**: python rw_eval.py --model_name_or_path /data/llama-2-hf/llama-2-7b-hf/ 3. **GPU**: A6000(48G) 4. **result** - **first result** ![image](https://github.com/microsoft/DeepSpeedExamples/assets/18341845/4bb006cf-7999-4870-99e0-ca39f8420042) - **second result** ![image](https://github.com/microsoft/DeepSpeedExamples/assets/18341845/5f3d7cf8-1627-4e78-9f73-5a65ac86831c) 5. Question: Why does the rw_eval.py script...

onlyfish79

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6

Hi, I successfully ran the ['cifar10_deepspeed.py' ](https://www.deepspeed.ai/tutorials/cifar-10/)example on a single node (2xNVIDIA 3090). Now I want to run the same program on multi-nodes (2 nodes each have 2 3090s.). I...

Rainbowman0

Enable overlap_comm for better performance

Enable overlap of backward computation and gradient all-reduce. This produces 1.05x end-to-end speedup in SFT training with my settings. See also https://github.com/microsoft/DeepSpeed/pull/4887.

li-plus

Some weights of GPTNeoXForCausalLM were not initialized from the model checkpoint

hi: After training the rlhf model（actor: pythia-6.9b, reward model:pythia-410M)，i evaluate the saved checkpoint by https://github.com/EleutherAI/lm-evaluation-harness，however, it seems that some weights are missing, here is the log: Some weights of GPTNeoXForCausalLM...

wangzhao88

deespeed chat

system

Invalidate trace cache @ step 0: expected module 0, but got module 6

Hi, when I use zero3 to train model, but occurs ```Invalidate trace cache @ step 0: expected module 0, but got module 6```, anyone who knows the reason.

boundles

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

Why is the shape of rm model all 0

Question: Why not padding to the same sequence length within the batch during the sft training phase?

Why does the rw_eval.py script return a different score for the same sample each time?

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6

Enable overlap_comm for better performance

Some weights of GPTNeoXForCausalLM were not initialized from the model checkpoint

async_pipeline is not exposed in the library

[DeepSpeed-Chat] Fix OOM issue in dataloader

How to train deepspeed-chat using nccl with multi-nodes?

Invalidate trace cache @ step 0: expected module 0, but got module 6

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard