DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Example models using DeepSpeed
When I run the demo ( step3_rlhf_finetuning/training_scripts/opt/single_node/run_1.3b.sh) without any change , the reward dose not increase. Is it normal? I would appreciate it if anyone can provide a normal reward...
When training the ppo model, I turned on the gradient_checkpointing_enable. If you want to calculate ptx loss, then actor will forward twice. In your code, these two loss are executed...
Hi, I read the deepspeed docs and have the following confusion: (1) What's the difference between these methods when in inferencing LLMs? a. deepspeed.initialize and then write code to generate...
Fixed two issues: * Padding should be ignored in training. Their labels should be set to `-100` for `CrossEntropyLoss` to ignore them. * Append correct `eos_token` to the response text....
I am trying to run `training/HelloDeepSpeed` example on a fresh python virtualenv but getting below error. I have installed dependencies using https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/requirements.txt ``` Traceback (most recent call last): File "/media/home/hemant/src/DeepSpeedExamples/training/HelloDeepSpeed/train_bert.py",...
I have benchmarked result of mii with the script of run_example.sh which located at "DeepSpeedExamples/benchmarks/inference/mii" in the repository, but it stalled as follows:  Then after a few minutes it...
**Problem:** When I got a previously-trained model state dict file, e.g., a reward model named `PATH/pytorch_model.bin`. When I try to reload it for further training using ZeRO3 optimizer, an error...
Might be a bug of hibrid engine : In Step3 wrong generation secquence when hibrid engine is enabled.
When using where using hybrid engine, The output sequence always be 'a a a a ', while if I disabled hybrid engine,the output sequence is correct here is my log...
my training environment is a docker image pulled from `deepspeed/deepspeed:v072_torch112_cu117` and i run it with `docker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network train-net --name fuyx-work -v...
【Repo Link】: [Stable-Diffusion inference](https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/stable-diffusion) 【The Command used to run】: ```deepspeed --num_gpus 1 test-stable-diffusion.py``` 【Envs】: ``` RTX 3090 deepspeed 0.12.6 torch 1.13.1 diffusers 0.26.1 triton 2.0.0.dev20221202 ``` 【Traceback information】:  It...