DeepSpeedExamples issues

ZeRO Stage 2 consumes more GPU memory than Stage 1

2

I was training an GPT-Neo (2.8B) model using the step1 script on 4 A10G GPUs. I used the default parameters in the example script but zero_stage=2 is consuming more GPU...

puyuanOT

deespeed chat

Job hang at model forward for rank 0 after saving immediate ckpt, multi nodes, ZeRO3

3

I want to save immediate ckpt in training after specfic steps while keep meeting job hang issue, how can I got it fixed? Torch 1.14 + CUDA 12.0, Transformer Engine...

xiaolhu1224

deespeed chat

Step2: memory allocation of 2097152 bytes failed

3

when I run step2 using 'bash training_scripts/single_node/run_350m.sh' meet error ```[2023-04-16 21:36:09,031] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-04-16 21:36:09,031] [INFO] [launch.py:235:main] nnodes=1,...

YukinoshitaKaren

bug

deespeed chat

use bloom-350m to train reward model in step2

1

I want to train bloom_350m in chinese dataset, and run run_350m.sh, change the model_name_or_path. But the loss is nan, how should I solve it? Is the argument "num_padding_at_beginning" cause this?

70557dzqc

deespeed chat

No IGNORE_INDEX for sft

https://github.com/microsoft/DeepSpeedExamples/blob/7eac9f699442fbc3f96b2dbdb2432d3847406a47/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L126 for stage 1 sft, the labels do not add IGNORE_INDEX for the prompt, should this be right?

payne53

deespeed chat

map[i] = val_or_map.get(i, Std.NONE) AttributeError: 'NoneType' object has no attribute 'get'

(gh_deepspeed) ub2004@ub2004-B85M-A0:~/llm_dev/DeepSpeedExamples/training/data_efficiency/gpt_finetuning$ python -m torch.distributed.launch --nproc_per_node=1 --master_port 12346 run_clm_no_trainer.py --random_ltd --dataset_name ptb_text_only --dataset_config_name penn_treebank --model_name_or_path gpt2 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --local_rank 2 --num_train_epochs 2 --deepspeed_config config/ds_config_gpt_base_random_ltd.json --deepspeed --seed 1234 --num_warmup_steps...

SeekPoint

bug

Simple Question: What's the interface used to serve the resultsing LMs at `./applications/DeepSpeed-Chat/README.md`

3

Just a simple question: The chat interface looks really nice. I wonder what are the libraries for this? I found no clues in readme.md as well as the code. So...

liyucheng09

question

deespeed chat

Empty partition cache

Fix #337

tjruwase

it took much time to initial:Initializing TorchBackend in DeepSpeed with backend nccl. Could you please help me to eliminate the consumed time.thx

Modas-Li

llama7b in v100*2（32gb） OOM

Pattaro

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

ZeRO Stage 2 consumes more GPU memory than Stage 1

Job hang at model forward for rank 0 after saving immediate ckpt, multi nodes, ZeRO3

Step2: memory allocation of 2097152 bytes failed

use bloom-350m to train reward model in step2

No IGNORE_INDEX for sft

map[i] = val_or_map.get(i, Std.NONE) AttributeError: 'NoneType' object has no attribute 'get'

Simple Question: What's the interface used to serve the resultsing LMs at `./applications/DeepSpeed-Chat/README.md`

Empty partition cache

it took much time to initial:Initializing TorchBackend in DeepSpeed with backend nccl. Could you please help me to eliminate the consumed time.thx

llama7b in v100*2（32gb） OOM

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard