TensorFlowASR icon indicating copy to clipboard operation
TensorFlowASR copied to clipboard

Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Open JunZhan2000 opened this issue 3 years ago • 6 comments

I am trying to train a Chinese model of a conformer. When I train with 4 2080ti, there will be an error in the middle of the epoch: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered, and the time of occurrence is not fixed. This problem doesn't occur when I train with only one gpu. please help me

This is my environment:

tensorflow-gpu==2.7 tensorflow-text==2.7 tensorflow-io==0.23

Below is my config.yml configuration

speech_config: sample_rate: 16000 frame_ms: 25 stride_ms: 10 num_feature_bins: 80 feature_type: log_mel_spectrogram preemphasis: 0.97 normalize_signal: True normalize_feature: True normalize_per_frame: False

decoder_config: vocabulary: /remote-home/jzhan/TensorFlowASR/vocabularies/AISHELL-1/AISHELL-1_10000.subwords target_vocab_size: 10000 max_subword_length: 10 blank_at_zero: True beam_width: 0 norm_score: True corpus_files: - /remote-home/jzhan/Datasets/AISHELL-1_test/train/transcripts.tsv

model_config: name: conformer encoder_subsampling: type: conv2d filters: 144 kernel_size: 3 strides: 2 encoder_positional_encoding: sinusoid encoder_dmodel: 144 encoder_num_blocks: 16 encoder_head_size: 36 encoder_num_heads: 4 encoder_mha_type: relmha encoder_kernel_size: 32 encoder_fc_factor: 0.5 encoder_dropout: 0.1 prediction_embed_dim: 320 prediction_embed_dropout: 0 prediction_num_rnns: 1 prediction_rnn_units: 320 prediction_rnn_type: lstm prediction_rnn_implementation: 2 prediction_layer_norm: True prediction_projection_units: 0 joint_dim: 320 prejoint_linear: True joint_activation: tanh joint_mode: add

learning_config: train_dataset_config: use_tf: True augmentation_config: feature_augment: time_masking: num_masks: 10 mask_factor: 100 p_upperbound: 0.05 freq_masking: num_masks: 1 mask_factor: 27 data_paths: - /remote-home/jzhan/Datasets/AISHELL-1/train/transcripts.tsv tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/train/tfrecords shuffle: True cache: True buffer_size: 100 drop_remainder: True stage: train

eval_dataset_config: use_tf: True data_paths: - /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: eval

test_dataset_config: use_tf: True data_paths: - /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: test

optimizer_config: warmup_steps: 40000 beta_1: 0.9 beta_2: 0.98 epsilon: 1e-9

running_config: batch_size: 8 num_epochs: 50 checkpoint: filepath: /remote-home/jzhan/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}.h5 save_best_only: False save_weights_only: True save_freq: epoch states_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/states tensorboard: log_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/tensorboard histogram_freq: 1 write_graph: True write_images: True update_freq: epoch profile_batch: 2

JunZhan2000 avatar Apr 12 '22 14:04 JunZhan2000

@guokr233 this might be the problem of tensorflow ifself or you didn't setup the environment right. Did you use anaconda3 (or miniconda)? And make sure your cuda driver is installed correctly on your machine. The anaconda ensures that your environment has enough library/packages needed to run for gpu correctly. I don't have experience to solve CUDA errors, so the only solution that I can provide is that we need to make sure we setup the environment right, and when the environment is not a problem, then we can test the newer version of tensorflow (2.8 for example)

nglehuy avatar Apr 16 '22 11:04 nglehuy

I'm creating the environment via conda, it's a really weird bug

JunZhan2000 avatar Apr 17 '22 02:04 JunZhan2000

@guokr233 this might be the problem of tensorflow ifself or you didn't setup the environment right. Did you use anaconda3 (or miniconda)? And make sure your cuda driver is installed correctly on your machine. The anaconda ensures that your environment has enough library/packages needed to run for gpu correctly. I don't have experience to solve CUDA errors, so the only solution that I can provide is that we need to make sure we setup the environment right, and when the environment is not a problem, then we can test the newer version of tensorflow (2.8 for example)

I created the environment through conda, and I upgraded to 2.8 version of tensorflow-gpu, cuda11.2, cuDNN 8.4.0. But still got this error. It seems that I can only train slowly with one GPU

JunZhan2000 avatar Apr 17 '22 07:04 JunZhan2000

@guokr233 Let me recheck the mirror strategy of tensorflow to see if there's any changes. Currently I don't have multi-gpus so it's hard to reproduce the issue. I've moved to TPUs on colab and it still works fine 😄

nglehuy avatar Apr 17 '22 12:04 nglehuy

I've also encountered the same problem. I've followed the solutions which are given on this issue, but it didn't work: https://github.com/tensorflow/tensorflow/issues/44281

Moreover, I followed this solution also, but again it didn't work: https://github.com/tensorflow/tensorflow/issues/40814#issuecomment-663838196

NusratNB avatar Oct 14 '22 00:10 NusratNB

@guokr233 can you try this solution: https://github.com/tensorflow/tensorflow/issues/50735#issuecomment-912320850

NusratNB avatar Oct 14 '22 01:10 NusratNB