Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
I am trying to train a Chinese model of a conformer. When I train with 4 2080ti, there will be an error in the middle of the epoch: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered, and the time of occurrence is not fixed. This problem doesn't occur when I train with only one gpu. please help me
This is my environment:
tensorflow-gpu==2.7 tensorflow-text==2.7 tensorflow-io==0.23
Below is my config.yml configuration
speech_config: sample_rate: 16000 frame_ms: 25 stride_ms: 10 num_feature_bins: 80 feature_type: log_mel_spectrogram preemphasis: 0.97 normalize_signal: True normalize_feature: True normalize_per_frame: False
decoder_config: vocabulary: /remote-home/jzhan/TensorFlowASR/vocabularies/AISHELL-1/AISHELL-1_10000.subwords target_vocab_size: 10000 max_subword_length: 10 blank_at_zero: True beam_width: 0 norm_score: True corpus_files: - /remote-home/jzhan/Datasets/AISHELL-1_test/train/transcripts.tsv
model_config: name: conformer encoder_subsampling: type: conv2d filters: 144 kernel_size: 3 strides: 2 encoder_positional_encoding: sinusoid encoder_dmodel: 144 encoder_num_blocks: 16 encoder_head_size: 36 encoder_num_heads: 4 encoder_mha_type: relmha encoder_kernel_size: 32 encoder_fc_factor: 0.5 encoder_dropout: 0.1 prediction_embed_dim: 320 prediction_embed_dropout: 0 prediction_num_rnns: 1 prediction_rnn_units: 320 prediction_rnn_type: lstm prediction_rnn_implementation: 2 prediction_layer_norm: True prediction_projection_units: 0 joint_dim: 320 prejoint_linear: True joint_activation: tanh joint_mode: add
learning_config: train_dataset_config: use_tf: True augmentation_config: feature_augment: time_masking: num_masks: 10 mask_factor: 100 p_upperbound: 0.05 freq_masking: num_masks: 1 mask_factor: 27 data_paths: - /remote-home/jzhan/Datasets/AISHELL-1/train/transcripts.tsv tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/train/tfrecords shuffle: True cache: True buffer_size: 100 drop_remainder: True stage: train
eval_dataset_config: use_tf: True data_paths: - /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: eval
test_dataset_config: use_tf: True data_paths: - /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: test
optimizer_config: warmup_steps: 40000 beta_1: 0.9 beta_2: 0.98 epsilon: 1e-9
running_config: batch_size: 8 num_epochs: 50 checkpoint: filepath: /remote-home/jzhan/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}.h5 save_best_only: False save_weights_only: True save_freq: epoch states_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/states tensorboard: log_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/tensorboard histogram_freq: 1 write_graph: True write_images: True update_freq: epoch profile_batch: 2
@guokr233 this might be the problem of tensorflow ifself or you didn't setup the environment right. Did you use anaconda3 (or miniconda)? And make sure your cuda driver is installed correctly on your machine. The anaconda ensures that your environment has enough library/packages needed to run for gpu correctly. I don't have experience to solve CUDA errors, so the only solution that I can provide is that we need to make sure we setup the environment right, and when the environment is not a problem, then we can test the newer version of tensorflow (2.8 for example)
I'm creating the environment via conda, it's a really weird bug
@guokr233 this might be the problem of tensorflow ifself or you didn't setup the environment right. Did you use anaconda3 (or miniconda)? And make sure your cuda driver is installed correctly on your machine. The anaconda ensures that your environment has enough library/packages needed to run for gpu correctly. I don't have experience to solve CUDA errors, so the only solution that I can provide is that we need to make sure we setup the environment right, and when the environment is not a problem, then we can test the newer version of tensorflow (2.8 for example)
I created the environment through conda, and I upgraded to 2.8 version of tensorflow-gpu, cuda11.2, cuDNN 8.4.0. But still got this error. It seems that I can only train slowly with one GPU
@guokr233 Let me recheck the mirror strategy of tensorflow to see if there's any changes. Currently I don't have multi-gpus so it's hard to reproduce the issue. I've moved to TPUs on colab and it still works fine 😄
I've also encountered the same problem. I've followed the solutions which are given on this issue, but it didn't work: https://github.com/tensorflow/tensorflow/issues/44281
Moreover, I followed this solution also, but again it didn't work: https://github.com/tensorflow/tensorflow/issues/40814#issuecomment-663838196
@guokr233 can you try this solution: https://github.com/tensorflow/tensorflow/issues/50735#issuecomment-912320850