Timeout DDP training using BioNeMo default configs
I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.
Epoch 0: 6%|██ | 32040/500150 [6:28:43<94:39:17, 1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.
^C
The configs are:
name: esm2nv
do_training: True # set to false if data preprocessing steps must be completed
do_testing: False # set to true to run evaluation on test data after training, requires test_dataset section
restore_from_path: null # used when starting from a .nemo file
trainer:
devices: 8 # number of GPUs or CPUs
num_nodes: 1
accelerator: gpu #gpu or cpu
precision: 16 #16 or 32
logger: False # logger is provided by NeMo exp_manager
enable_checkpointing: False # checkpointing is done by NeMo exp_manager
replace_sampler_ddp: False # use NeMo Megatron samplers
max_epochs: null # # use max_steps instead with NeMo Megatron model
log_every_n_steps: 10 # number of interations between logging
val_check_interval: 15e4
limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
accumulate_grad_batches: 1
gradient_clip_val: 1.0
benchmark: False
max_steps: 500000
exp_manager:
name: ${name}
exp_dir: ${oc.env:BIONEMO_HOME}/results/nemo_experiments/${.name}/${.wandb_logger_kwargs.name}
explicit_log_dir: ${.exp_dir}
create_wandb_logger: True
create_tensorboard_logger: True
wandb_logger_kwargs:
project: ${name}_pretraining
name: ${name}_pretraining
group: ${name}
job_type: Localhost_nodes_${trainer.num_nodes}_gpus_${trainer.devices}
notes: "date: ${now:%y%m%d-%H%M%S}"
tags:
- ${name}
offline: False # set to True if there are issues uploading to WandB during training
resume_if_exists: True # automatically resume if checkpoint exists
resume_ignore_no_checkpoint: True # leave as True, will start new training if resume_if_exists is True but no checkpoint exists
create_checkpoint_callback: True # leave as True, use exp_manger for for checkpoints
checkpoint_callback_params:
monitor: val_loss
save_top_k: 10 # number of checkpoints to save
mode: min # use min or max of monitored metric to select best checkpoints
always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
filename: 'megatron_bert--{val_loss:.2f}-{step}-{consumed_samples}'
model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
model:
precision: 16
# ESM2-specific parameters
micro_batch_size: 100
seq_length: 1024
num_layers: 6
hidden_size: 320
ffn_hidden_size: ${multiply:${model.hidden_size}, 4} # Transformer FFN hidden size. Usually 4 * hidden_size.
num_attention_heads: 20
megatron_legacy: false
position_embedding_type: rope # ESM2 uses relative positional encoding 'ROPE' to extrapolate to longer sequences unseen during training
hidden_dropout: 0 # ESM2 removes dropout from hidden layers and attention
embedding_use_attention_mask: True # ESM2 uses attention masking on the embeddings
embedding_token_dropout: True # ESM2 rescales embeddings based on masked tokens
mask_token_id: ${.tokenizer.mask_id} # Needed for token dropout rescaling
attention_dropout: 0.0 # ESM2 does not use attention dropout
normalize_attention_scores: False # ESM2 does not use normalized attention scores
tensor_model_parallel_size: 1 # model parallelism
pipeline_model_parallel_size: 1 # model parallelism. If enabled, you need to set data.dynamic_padding to False as pipeline parallelism requires fixed-length padding.
bias_gelu_fusion: False
# NOTE: these are compatability features
use_esm_attention: True # Use specific attention modifications for ESM2
esm_gelu: True # ESM2 uses custom gelu in the ML layer
use_pt_layernorm: False # Use pytorch implementation of layernorm instead of fused nemo layernorm. Important for equivalency of results with ESM2.
use_pt_mlp_out: False # Use pytorch implementation of attention output mlp instead of the nemo version. Important for equivalency of results with ESM2.
# Not specified in ESM2 models:
# model architecture
max_position_embeddings: ${.seq_length}
encoder_seq_length: ${.seq_length}
optim:
name: fused_adam # fused optimizers used by Megatron model
lr: 4e-4
weight_decay: 0.01
betas:
- 0.9
- 0.98
sched:
name: CosineAnnealing
warmup_steps: 2000
constant_steps: 50000
min_lr: 4e-5
init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
layernorm_epsilon: 1e-5
make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
pre_process: True # add embedding
post_process: True # add pooler
bert_binary_head: False # BERT binary head
resume_from_checkpoint: null # manually set the checkpoint file to load from
# NOTE: is this one of the new fields?
masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
tokenizer:
# Use ESM2 tokenizers from HF
library: 'huggingface'
type: 'BertWordPieceLowerCase'
model_name: "???"
mask_id: 32
model: null
vocab_file: null
merge_file: null
# precision
native_amp_init_scale: 4294967296 # 2 ** 32
native_amp_growth_interval: 1000
fp32_residual_connection: False # Move residual connections to fp32
fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16
# miscellaneous
seed: 1234
use_cpu_initialization: False # Init weights on the CPU (slow for large model)
onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
# not implemented in NeMo yet
activations_checkpoint_method: null # 'uniform', 'block'
activations_checkpoint_num_layers: 1
data:
ngc_registry_target: uniref50_2022_05
ngc_registry_version: v23.06
data_prefix: "" # must be null or ""
num_workers: 8
dataloader_type: single # cyclic
reset_position_ids: False # Reset position ids after end-of-document token
reset_attention_mask: False # Reset attention mask after end-of-document token
eod_mask_loss: False # Mask loss for the end of document tokens
masked_lm_prob: 0.15 # Probability of replacing a token with mask.
short_seq_prob: 0.1 # Probability of producing a short sequence.
skip_lines: 0
drop_last: False
pin_memory: False
dynamic_padding: False # If True, each batch is padded to the maximum sequence length within that batch.
# Set it to False when model.pipeline_model_parallel_size > 1, as pipeline parallelism requires fixed-length padding.
# Below is the configuration for UF90 resampling.
force_regen_sample_mapping: false # When true, will always generate a new uf90 resampling. otherwise, change the seed.
data_impl: "csv_mmap"
# Supported kwargs (with default values):
# text_mmap (newline_int=10, header_lines=0, workers=None, sort_dataset_paths=True)
# csv_mmap (newline_int=10, header_lines=0,workers=None, sort_dataset_paths=True, data_col=1, data_sep=",")
dataset: # inclusive range of data files to load x[000..049] or can a single file, e.g. x000
train: x[000..049]
test: x[000..049]
val: x[000..049]
data_impl_kwargs:
csv_mmap:
data_col: 1 # 0-based
header_lines: 1
# Rest of these are inherited
uf50_datapath: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/uniref50_train_filt.fasta
uf90_datapath: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/uniref90membersandreps_ur50trainfiltreps.fasta
cluster_mapping_tsv: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/mapping.tsv
# TODO These need to be updated to values we actually like
dataset_path: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/uf50 # parent directory for data, contains train / val / test folders. Needs to be writeable for index creation.
# NOTE test_size and val_size must be smaller than the total dataset size.
# you can check this with grep -c '>' <uf50_datapath.fasta>
val_size: 5000
test_size: 1000000
sort_fastas: false # If true, assumes the input files are not sorted, and sorts them before creating the cluster mapping. Unsorted fastas will break the cluster map.
uf90:
uniref90_path: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/uf90/ # created and populated my preprocessing
dataset:
uf90_csvs: x[000..049] # created and populated by preprocessing, this key is a directory inside uniref90_path
data_impl: 'csv_fields_mmap'
data_impl_kwargs:
csv_fields_mmap:
header_lines: 1
newline_int: 10 # byte-value of newline
workers: ${model.data.num_workers} # number of workers when creating missing index files (null defaults to cpu_num / 2)
sort_dataset_paths: True # if True datasets will be sorted by name
data_sep: ',' # string to split text into columns
data_fields:
sequence: 3
sequence_id: 1
index_mapping_dir: ${model.data.index_mapping_dir}
use_upsampling: True # if the data should be upsampled to max number of steps in the training
seed: ${model.seed} # Random seed
max_seq_length: ${model.seq_length} # Maximum input sequence length. Longer sequences are truncated
modify_percent: 0.1 # Percentage of characters in a protein sequence to modify. (Modification means replacing with another amino acid or with a mask token)
perturb_percent: 0.5 # Of the modify_percent, what percentage of characters are to be replaced with another amino acid.
index_mapping_dir: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/
dwnstr_task_validation:
enabled: False
dataset:
class: bionemo.model.core.dwnstr_task_callbacks.PerTokenPredictionCallback
task_type: token-level-classification
infer_target: bionemo.model.protein.esm1nv.infer.ESM1nvInference
max_seq_length: ${model.seq_length}
emb_batch_size: 128
batch_size: 128
num_epochs: 10
shuffle: True
num_workers: 8
task_name: secondary_structure
dataset_path: ${oc.env:BIONEMO_HOME}/data/FLIP/${model.dwnstr_task_validation.dataset.task_name}
dataset:
train: x000
test: x000
sequence_column: "sequence" # name of column with protein sequence in csv file
target_column: [ "3state", "resolved" ] # names of label columns in csv file
target_sizes: [ 3, 2 ] # number of classes in each label
mask_column: [ "resolved", null ] # names of mask columns in csv file, masks must be 0 or 1
random_seed: 1234
optim:
name: adam
lr: 0.0001
betas:
- 0.9
- 0.999
eps: 1e-8
weight_decay: 0.01
sched:
name: WarmupAnnealing
min_lr: 0.00001
last_epoch: -1
warmup_ratio: 0.01
max_steps: 1000
Running into a similar issue training the conformer large model
in a docker container with the latest nvcr.io/nvidia/nemo:23.10 image, on p2.16xlarge (V100 instances). What is your training environment like?
This is my docker container
nvcr.io/nvidia/clara/bionemo-framework:latest "/workspace/bionemo/…" 4 days ago Up 13 hours bionemo
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
I am facing the same issue on a multi node multi GPU and without docker. I am utilizing slurm to run the job.