DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

How to diagnose the cause of frozen training.

Open Santosh-Gupta opened this issue 4 years ago • 6 comments

I was running some training when the training seemed to freeze in the middle of an epoch. I don't see any other issues for this, so it could be related to my code or my system, but I am trying to figure out how to pinpoint the issue.

This is what the output looked like before it froze

  2%|██▍                                                                                                                                             | 114000/6629075 [22:10:33<1283:00:44,  1.41it/s] | average loss: 6.06370         [timer.py:154:stop] 0/12 | average loss: 6.063703125                                                                                                                                                                                                  671875
 | average loss: 6.07375                                                                                                                                                                                                      3125  2%|██▍                                                                                                                                             | 114000/6629075 [22:10:33<1283:07:53,  1.41it/s] | average loss: 6.07337 | average loss: 6.07337109375                                                                                                                                                                                                109375  2%|██▍                                                                                                                                             | 114788/6629075 [22:19:44<1320:01:46,  1.37it/s]

After being stuck on this for a few days, I did cntrl-c and got got this

Killing subprocess 26589
Killing subprocess 26590
Killing subprocess 26591
Main process received SIGINT, exiting
Traceback (most recent call last):
  File "/opt/conda/bin/deepspeed", line 6, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 360, in main
    result.wait()
  File "/opt/conda/lib/python3.8/subprocess.py", line 1079, in wait
    return self._wait(timeout=timeout)
  File "/opt/conda/lib/python3.8/subprocess.py", line 1804, in _wait
    (pid, sts) = self._try_wait(0)
  File "/opt/conda/lib/python3.8/subprocess.py", line 1762, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

My guess is that maybe there was an issue with the data pipeline.

Santosh-Gupta avatar Jul 06 '21 15:07 Santosh-Gupta

@Santosh-Gupta, can you share more details on this, such as the model and command line?

tjruwase avatar Jul 06 '21 16:07 tjruwase

It's a transformer model, with some custom heads. Maybe there's some loss issue I'm overlooking?

The command like is just `bash deepspeed/deepspeed_ss.sh 4 '

And then file is

NGPU_PER_NODE=$1
MODEL_FILE=$2
SQUAD_DIR=$3
OUTPUT_DIR=$4
LR=${5:-0.00003}
SEED=${6:-12345}
MASTER_PORT=${7:-29500}
DROPOUT=${8:-0.1}
echo "lr is ${LR}"
echo "seed is $SEED"
echo "master port is $MASTER_PORT"
echo "dropout is ${DROPOUT}"

# Force deepspeed to run with only local node
NUM_NODES=1
HOSTFILE=/dev/null

NGPU=$((NGPU_PER_NODE*NUM_NODES))
EFFECTIVE_BATCH_SIZE=24
MAX_GPU_BATCH_SIZE=3
PER_GPU_BATCH_SIZE=$((EFFECTIVE_BATCH_SIZE/NGPU))
if [[ $PER_GPU_BATCH_SIZE -lt $MAX_GPU_BATCH_SIZE ]]; then
       GRAD_ACCUM_STEPS=1
else
       GRAD_ACCUM_STEPS=$((PER_GPU_BATCH_SIZE/MAX_GPU_BATCH_SIZE))
fi
JOB_NAME="deepspeed_${NGPU}GPUs_${EFFECTIVE_BATCH_SIZE}batch_size"
config_json=deepspeed_config.json
run_cmd="deepspeed --num_nodes ${NUM_NODES} --num_gpus ${NGPU_PER_NODE} \
       --master_port=${MASTER_PORT} \
       --hostfile ${HOSTFILE} \
       kepler/deepspeed_train.py 
       "
       #--preln \
echo ${run_cmd}
eval ${run_cmd}

So other helpful info, this is the training part of deepspeed_train.py looks like

def train(args):
    # deepspeed.init_distributed(dist_backend='nccl')
    torch.distributed.init_process_group(backend="nccl")
    config = get_model_config()

    model, optimizer, optimizer_grouped_parameters, scheduler = load_model(config)

    deepspeed_config = {
        "train_micro_batch_size_per_gpu": batch_size, 
        "gradient_accumulation_steps": 64, 
        "optimizer": {
            "type": "Adam",
            "params": {
                "lr": 3e-5,
            }
        },
        "gradient_clipping": 1.0,
        "fp16": {
            "enabled": True
        },
        "zero_optimization": {
            "stage": 3
        },
        "steps_per_print": 2000,
        "output_file": "/home/santosh/output_files/"
    }

    model_engine, optimizer, ds_training_dataloader, lr_scheduler = deepspeed.initialize(model=model,
                                                        model_parameters=optimizer_grouped_parameters,
                                                        lr_scheduler=scheduler, #automatically executed at model_engine.step()
                                                        training_data=AugmentedDataset(
                                                            DiskDataset()
                                                            ),
                                                        collate_fn=collate_ds, 
                                                        config_params=deepspeed_config)

    def gpu_map(inn):
        return inn.to(model_engine.local_rank)

    loss_report_it=1_000

    log = pd.DataFrame(columns=['loss', 'lr', 'step', 'epoch'])
    pd_index=0 # append gives an an error unless there's an index 

    for ii in range(num_epochs):
        print('Epoch number: ', ii)
        sum_loss = 0
        avg_loss = 0
        for iii, model_batch in enumerate(tqdm(ds_training_dataloader)):
            if iii%loss_report_it==0:
                print(' | average loss: '+str(avg_loss))
            
            model_batch = valmap(gpu_map, model_batch) 
            loss = model_engine(model_batch)
            model_engine.backward(loss)
            model_engine.step()
            
            sum_loss = sum_loss + loss.item()
            
            if iii%loss_report_it==0:
                avg_loss = sum_loss/loss_report_it
                sum_loss = 0

            if iii%100_000==0 and iii>10:
                model_engine.save_checkpoint('/home/secondTest/')

            if (iii > 0) and (iii % config.log_steps == 0):
                log = log.append(pd.DataFrame(data={
                    'lr': model_engine.get_lr()[0],
                    'loss': avg_loss,
                    'step': iii,
                    'epoch': ii
                }, index=[pd_index]), ignore_index=True)
                
                log.to_csv(os.path.join('/home/secondTest', 
                                        'log/log.tsv'), 
                            sep='\t', index=False)
                pd_index+=1
                running_loss_log = 0

    model_engine.save_checkpoint('/home/secondTest/')
    print('Training Finished')

Any ideas? Let me know if some other info may be useful.

Santosh-Gupta avatar Jul 07 '21 06:07 Santosh-Gupta

Are you able to reproduce the hang? If so, we have had good luck debugging this sort of thing with py-spy:

https://pypi.org/project/py-spy/

py-spy dump --pid [process-id]

This will get you the Python stack trace of a process. Then you’ll just want to check all the running training processes.

jeffra avatar Jul 11 '21 22:07 jeffra

Are you able to reproduce the hang? If so, we have had good luck debugging this sort of thing with py-spy:

https://pypi.org/project/py-spy/

py-spy dump --pid [process-id]

This will get you the Python stack trace of a process. Then you’ll just want to check all the running training processes.

Not yet, it took about 2 weeks of training for the freeze to happen.

py-spy looks like a powerful debugging tool, thanks for sending it our way

Santosh-Gupta avatar Jul 13 '21 07:07 Santosh-Gupta

Having the same issue. Did you solve it?

sbmaruf avatar Nov 10 '21 05:11 sbmaruf

Having the same issue. Did you solve it?

weiyx16 avatar Sep 13 '22 01:09 weiyx16

Closing this issue as it is fairly stale - if anyone hits this and wants to debug and can post a fresh stacktrace, please open a new issue and link this one, thanks!

loadams avatar Sep 20 '23 20:09 loadams