DeepSpeed
DeepSpeed copied to clipboard
How to diagnose the cause of frozen training.
I was running some training when the training seemed to freeze in the middle of an epoch. I don't see any other issues for this, so it could be related to my code or my system, but I am trying to figure out how to pinpoint the issue.
This is what the output looked like before it froze
2%|██▍ | 114000/6629075 [22:10:33<1283:00:44, 1.41it/s] | average loss: 6.06370 [timer.py:154:stop] 0/12 | average loss: 6.063703125 671875
| average loss: 6.07375 3125 2%|██▍ | 114000/6629075 [22:10:33<1283:07:53, 1.41it/s] | average loss: 6.07337 | average loss: 6.07337109375 109375 2%|██▍ | 114788/6629075 [22:19:44<1320:01:46, 1.37it/s]
After being stuck on this for a few days, I did cntrl-c and got got this
Killing subprocess 26589
Killing subprocess 26590
Killing subprocess 26591
Main process received SIGINT, exiting
Traceback (most recent call last):
File "/opt/conda/bin/deepspeed", line 6, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 360, in main
result.wait()
File "/opt/conda/lib/python3.8/subprocess.py", line 1079, in wait
return self._wait(timeout=timeout)
File "/opt/conda/lib/python3.8/subprocess.py", line 1804, in _wait
(pid, sts) = self._try_wait(0)
File "/opt/conda/lib/python3.8/subprocess.py", line 1762, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
My guess is that maybe there was an issue with the data pipeline.
@Santosh-Gupta, can you share more details on this, such as the model and command line?
It's a transformer model, with some custom heads. Maybe there's some loss issue I'm overlooking?
The command like is just `bash deepspeed/deepspeed_ss.sh 4 '
And then file is
NGPU_PER_NODE=$1
MODEL_FILE=$2
SQUAD_DIR=$3
OUTPUT_DIR=$4
LR=${5:-0.00003}
SEED=${6:-12345}
MASTER_PORT=${7:-29500}
DROPOUT=${8:-0.1}
echo "lr is ${LR}"
echo "seed is $SEED"
echo "master port is $MASTER_PORT"
echo "dropout is ${DROPOUT}"
# Force deepspeed to run with only local node
NUM_NODES=1
HOSTFILE=/dev/null
NGPU=$((NGPU_PER_NODE*NUM_NODES))
EFFECTIVE_BATCH_SIZE=24
MAX_GPU_BATCH_SIZE=3
PER_GPU_BATCH_SIZE=$((EFFECTIVE_BATCH_SIZE/NGPU))
if [[ $PER_GPU_BATCH_SIZE -lt $MAX_GPU_BATCH_SIZE ]]; then
GRAD_ACCUM_STEPS=1
else
GRAD_ACCUM_STEPS=$((PER_GPU_BATCH_SIZE/MAX_GPU_BATCH_SIZE))
fi
JOB_NAME="deepspeed_${NGPU}GPUs_${EFFECTIVE_BATCH_SIZE}batch_size"
config_json=deepspeed_config.json
run_cmd="deepspeed --num_nodes ${NUM_NODES} --num_gpus ${NGPU_PER_NODE} \
--master_port=${MASTER_PORT} \
--hostfile ${HOSTFILE} \
kepler/deepspeed_train.py
"
#--preln \
echo ${run_cmd}
eval ${run_cmd}
So other helpful info, this is the training part of deepspeed_train.py looks like
def train(args):
# deepspeed.init_distributed(dist_backend='nccl')
torch.distributed.init_process_group(backend="nccl")
config = get_model_config()
model, optimizer, optimizer_grouped_parameters, scheduler = load_model(config)
deepspeed_config = {
"train_micro_batch_size_per_gpu": batch_size,
"gradient_accumulation_steps": 64,
"optimizer": {
"type": "Adam",
"params": {
"lr": 3e-5,
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": True
},
"zero_optimization": {
"stage": 3
},
"steps_per_print": 2000,
"output_file": "/home/santosh/output_files/"
}
model_engine, optimizer, ds_training_dataloader, lr_scheduler = deepspeed.initialize(model=model,
model_parameters=optimizer_grouped_parameters,
lr_scheduler=scheduler, #automatically executed at model_engine.step()
training_data=AugmentedDataset(
DiskDataset()
),
collate_fn=collate_ds,
config_params=deepspeed_config)
def gpu_map(inn):
return inn.to(model_engine.local_rank)
loss_report_it=1_000
log = pd.DataFrame(columns=['loss', 'lr', 'step', 'epoch'])
pd_index=0 # append gives an an error unless there's an index
for ii in range(num_epochs):
print('Epoch number: ', ii)
sum_loss = 0
avg_loss = 0
for iii, model_batch in enumerate(tqdm(ds_training_dataloader)):
if iii%loss_report_it==0:
print(' | average loss: '+str(avg_loss))
model_batch = valmap(gpu_map, model_batch)
loss = model_engine(model_batch)
model_engine.backward(loss)
model_engine.step()
sum_loss = sum_loss + loss.item()
if iii%loss_report_it==0:
avg_loss = sum_loss/loss_report_it
sum_loss = 0
if iii%100_000==0 and iii>10:
model_engine.save_checkpoint('/home/secondTest/')
if (iii > 0) and (iii % config.log_steps == 0):
log = log.append(pd.DataFrame(data={
'lr': model_engine.get_lr()[0],
'loss': avg_loss,
'step': iii,
'epoch': ii
}, index=[pd_index]), ignore_index=True)
log.to_csv(os.path.join('/home/secondTest',
'log/log.tsv'),
sep='\t', index=False)
pd_index+=1
running_loss_log = 0
model_engine.save_checkpoint('/home/secondTest/')
print('Training Finished')
Any ideas? Let me know if some other info may be useful.
Are you able to reproduce the hang? If so, we have had good luck debugging this sort of thing with py-spy:
https://pypi.org/project/py-spy/
py-spy dump --pid [process-id]
This will get you the Python stack trace of a process. Then you’ll just want to check all the running training processes.
Are you able to reproduce the hang? If so, we have had good luck debugging this sort of thing with py-spy:
https://pypi.org/project/py-spy/
py-spy dump --pid [process-id]
This will get you the Python stack trace of a process. Then you’ll just want to check all the running training processes.
Not yet, it took about 2 weeks of training for the freeze to happen.
py-spy looks like a powerful debugging tool, thanks for sending it our way
Having the same issue. Did you solve it?
Having the same issue. Did you solve it?
Closing this issue as it is fairly stale - if anyone hits this and wants to debug and can post a fresh stacktrace, please open a new issue and link this one, thanks!