DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Example models using DeepSpeed
hello, thanks for your wonderful work. I have a question about the train stage. I use GPT2 (117M) to finetune on my own datasets, the GPT2 model size is about...
- I use offload、gradient_checkpointing and zero_stage 3, and still get OOM result - I test it in 8*A100 80G, and see about 55G GPU memory consumption via "nvidia-smi" - my...
**May I ask whether the official has considered setting up a communication group?** I temporarily set up a communication group, and everyone can communicate together. I hope that the official...
can't find the "--deepspeed" in [main.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py) but find in shell(training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh) ``` deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-350m \ --num_padding_at_beginning 1 --gradient_accumulation_steps 2 --zero_stage $ZERO_STAGE \ --only_optimize_lora \ --deepspeed --output_dir $OUTPUT...
I was running the Stage 2 reward model training with the multinode setup and I experienced the following error ``` ... truncated 0%| | 0/2 [00:00
Hi, I have encounter the **TypeError: LlamaModel.forward() got an unexpected keyword argument 'head_mask'** error when training the LLaMA-7B model in step 2 reward model training. I was wondering if the...
Attempting to reproduce the effect of 1.6b step1 SFT using the default single_node script configuration resulted in slow training on 4 V100 32G GPUs. It took 6 hours to complete,...
I run 1.6 billion parameters demo, it cost 1:46:27 on first step and 2:12:56 on second step. it's much slower than below. Actor: OPT-1.3B Reward: OPT-350M | 2900 Sec |...
[BUG]Step1 RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
bash training_scripts/single_node/run_1.3b_lora.sh Traceback (most recent call last): File "main.py", line 328, in main() File "main.py", line 301, in main model.backward(loss) File "/home/kemove/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs)...