DeepSpeedExamples issues

why the model size reduced after SFT step?

2

hello, thanks for your wonderful work. I have a question about the train stage. I use GPT2 (117M) to finetune on my own datasets, the GPT2 model size is about...

xdnjust

Reproduction Failure : 8*A100 40G run opt-13b stage3_RLHF OOM

2

- I use offload、gradient_checkpointing and zero_stage 3, and still get OOM result - I test it in 8*A100 80G, and see about 55G GPU memory consumption via "nvidia-smi" - my...

leo5856

Exchange group; 交流群

8

**May I ask whether the official has considered setting up a communication group?** I temporarily set up a communication group, and everyone can communicate together. I hope that the official...

yrqUni

question

deespeed chat

no "--deepspeed" in step2 shell

1

can't find the "--deepspeed" in [main.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py) but find in shell(training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh) ``` deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-350m \ --num_padding_at_beginning 1 --gradient_accumulation_steps 2 --zero_stage $ZERO_STAGE \ --only_optimize_lora \ --deepspeed --output_dir $OUTPUT...

ucas010

Running multinode training and received unclear error for stage 2 training

4

I was running the Stage 2 reward model training with the multinode setup and I experienced the following error ``` ... truncated 0%| | 0/2 [00:00

alibabadoufu

bug

deespeed chat

how to save model, i cant load saved llama7b model

1

Pattaro

Using LLaMA in reward model training

6

Hi, I have encounter the **TypeError: LlamaModel.forward() got an unexpected keyword argument 'head_mask'** error when training the LLaMA-7B model in step 2 reward model training. I was wondering if the...

YingHH1

deespeed chat

llama

Reproduction Failure

2

Attempting to reproduce the effect of 1.6b step1 SFT using the default single_node script configuration resulted in slow training on 4 V100 32G GPUs. It took 6 hours to complete,...

JingerAI

deespeed chat

Run 1.6 billon demo is much slow than description on A100 GPU?

3

I run 1.6 billion parameters demo, it cost 1:46:27 on first step and 2:12:56 on second step. it's much slower than below. Actor: OPT-1.3B Reward: OPT-350M | 2900 Sec |...

tcluoct

deespeed chat

[BUG]Step1 RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

1

bash training_scripts/single_node/run_1.3b_lora.sh Traceback (most recent call last): File "main.py", line 328, in main() File "main.py", line 301, in main model.backward(loss) File "/home/kemove/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs)...

qinqinqaq

bug

deespeed chat

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

why the model size reduced after SFT step?

Reproduction Failure : 8*A100 40G run opt-13b stage3_RLHF OOM

Exchange group; 交流群

no "--deepspeed" in step2 shell

Running multinode training and received unclear error for stage 2 training

how to save model, i cant load saved llama7b model

Using LLaMA in reward model training

Reproduction Failure

Run 1.6 billon demo is much slow than description on A100 GPU?

[BUG]Step1 RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard