OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

(Again, but different) AssertionError: assert model_dim % head_count == 0

Open James-Decatur opened this issue 3 months ago • 2 comments

Hello,

I'm a graduate student at Indiana University and am trying to run OpenNMT on one of our supercomputers. I keep getting the same error listed here: https://github.com/OpenNMT/OpenNMT-py/issues/952, but I already made the suggested changes. Any idea what the issue could be?

The one change I made was the switch to one GPU (and it runs on Google Colab just fine).

Beforehand, I got an error message saying something along the lines of this 'A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.' I do not remember what I did to correct this problem.


new_no.yaml

Data configurations

save_data: drive/MyDrive/MT_DATA/ src_vocab: drive/MyDrive/MT_DATA/vocab.src tgt_vocab: drive/MyDrive/MT_DATA/vocab.tgt save_model: drive/MyDrive/MT_DATA/ overwrite: True data: corpus_1: path_src: drive/MyDrive/MT_DATA/train_set_11_char.txt path_tgt: drive/MyDrive/MT_DATA/train_set_2_char.txt valid: path_src: drive/MyDrive/MT_DATA/dev_set_11_char.txt path_tgt: drive/MyDrive/MT_DATA/dev_set_2_char.txt

Training settings

save_checkpoint_steps: 10000 valid_steps: 10000 train_steps: 200000

Batching

bucket_size: 262144 world_size: 1 # Since only one GPU is available gpu_ranks: [0] # Adjusted for single GPU num_workers: 2 batch_type: "tokens" batch_size: 4096 valid_batch_size: 2048 accum_count: [4] accum_steps: [0]

Optimization

model_dtype: "fp16" optim: "adam" learning_rate: 2 warmup_steps: 8000 decay_method: "noam" adam_beta2: 0.998 max_grad_norm: 0 label_smoothing: 0.1 param_init: 0 param_init_glorot: true normalization: "tokens"

Model architecture

encoder_type: transformer decoder_type: transformer position_encoding: true enc_layers: 6 dec_layers: 6 heads: 8 hidden_size: 512 word_vec_size: 512 transformer_ff: 2048 dropout_steps: [0] dropout: [0.1] attention_dropout: [0.1]


James-Decatur avatar Mar 13 '24 18:03 James-Decatur

If your model's dimension is evenly divisible by the head count, the assertion model_dim % head_count == 0 should not cause an error regardless of the computer you are using. Therefore, ensure to verify the paths and configurations you are referencing. If you are getting this error, since this is the only check, it means your config is wrong.

vince62s avatar Mar 13 '24 18:03 vince62s

Hello Vincent,

Could you elaborate more? I don't quite understand what you are trying to say.

Thank you, Jim

James-Decatur avatar Mar 14 '24 00:03 James-Decatur