DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Example models using DeepSpeed
I was training an GPT-Neo (2.8B) model using the step1 script on 4 A10G GPUs. I used the default parameters in the example script but zero_stage=2 is consuming more GPU...
I want to save immediate ckpt in training after specfic steps while keep meeting job hang issue, how can I got it fixed? Torch 1.14 + CUDA 12.0, Transformer Engine...
when I run step2 using 'bash training_scripts/single_node/run_350m.sh' meet error ```[2023-04-16 21:36:09,031] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-04-16 21:36:09,031] [INFO] [launch.py:235:main] nnodes=1,...
I want to train bloom_350m in chinese dataset, and run run_350m.sh, change the model_name_or_path. But the loss is nan, how should I solve it? Is the argument "num_padding_at_beginning" cause this?
https://github.com/microsoft/DeepSpeedExamples/blob/7eac9f699442fbc3f96b2dbdb2432d3847406a47/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L126 for stage 1 sft, the labels do not add IGNORE_INDEX for the prompt, should this be right?
(gh_deepspeed) ub2004@ub2004-B85M-A0:~/llm_dev/DeepSpeedExamples/training/data_efficiency/gpt_finetuning$ python -m torch.distributed.launch --nproc_per_node=1 --master_port 12346 run_clm_no_trainer.py --random_ltd --dataset_name ptb_text_only --dataset_config_name penn_treebank --model_name_or_path gpt2 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --local_rank 2 --num_train_epochs 2 --deepspeed_config config/ds_config_gpt_base_random_ltd.json --deepspeed --seed 1234 --num_warmup_steps...
Just a simple question: The chat interface looks really nice. I wonder what are the libraries for this? I found no clues in readme.md as well as the code. So...
Fix #337