DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Example models using DeepSpeed

Results 323 DeepSpeedExamples issues
Sort by recently updated
recently updated
newest added

hello, thanks for your wonderful work. I have a question about the train stage. I use GPT2 (117M) to finetune on my own datasets, the GPT2 model size is about...

- I use offload、gradient_checkpointing and zero_stage 3, and still get OOM result - I test it in 8*A100 80G, and see about 55G GPU memory consumption via "nvidia-smi" - my...

**May I ask whether the official has considered setting up a communication group?** I temporarily set up a communication group, and everyone can communicate together. I hope that the official...

question
deespeed chat

can't find the "--deepspeed" in [main.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py) but find in shell(training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh) ``` deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-350m \ --num_padding_at_beginning 1 --gradient_accumulation_steps 2 --zero_stage $ZERO_STAGE \ --only_optimize_lora \ --deepspeed --output_dir $OUTPUT...

I was running the Stage 2 reward model training with the multinode setup and I experienced the following error ``` ... truncated 0%| | 0/2 [00:00

bug
deespeed chat

Hi, I have encounter the **TypeError: LlamaModel.forward() got an unexpected keyword argument 'head_mask'** error when training the LLaMA-7B model in step 2 reward model training. I was wondering if the...

deespeed chat
llama

Attempting to reproduce the effect of 1.6b step1 SFT using the default single_node script configuration resulted in slow training on 4 V100 32G GPUs. It took 6 hours to complete,...

deespeed chat

I run 1.6 billion parameters demo, it cost 1:46:27 on first step and 2:12:56 on second step. it's much slower than below. Actor: OPT-1.3B Reward: OPT-350M | 2900 Sec |...

deespeed chat

bash training_scripts/single_node/run_1.3b_lora.sh Traceback (most recent call last): File "main.py", line 328, in main() File "main.py", line 301, in main model.backward(loss) File "/home/kemove/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs)...

bug
deespeed chat