yanmtt icon indicating copy to clipboard operation
yanmtt copied to clipboard

Getting error when try to pre-train for three languages

Open Aniruddha-JU opened this issue 2 years ago • 15 comments

using the below command:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi,kn,bn --mono_src /home/aniruddha/all_data/train.hi,/home/aniruddha/all_data/train.kn,/home/aniruddha/all_data/train.bn --batch_size 8 --batch_size_indicates_lines --shard_files --model_path aibharat/IndicBART/model --port 7878


Using label smoothing of 0.1 Using gradient clipping norm of 1.0 Using softmax temperature of 1.0 Masking ratio: 0.3 Training for: ['hi', 'kn', 'bn'] Shuffling corpus! Shuffling corpus! Shuffling corpus! Saving the model Loading from checkpoint Traceback (most recent call last): File "pretrain_nmt.py", line 968, in run_demo() File "pretrain_nmt.py", line 965, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap fn(i, *args) File "/home/aniruddha/yanmtt/pretrain_nmt.py", line 521, in model_create_load_run_save lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id File "/home/aniruddha/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss smooth_loss.masked_fill(pad_mask, 0.0) RuntimeError: The expanded size of the tensor (316) must match the existing size (315) at non-singleton dimension 1. Target sizes: [8, 316, 1]. Tensor sizes: [8, 315, 1]

Aniruddha-JU avatar Aug 29 '22 14:08 Aniruddha-JU

Have you converted the scripts for non devanagari languages to devanagari?

Look here: https://github.com/AI4Bharat/indic-bart

That's likely the reason.

prajdabre avatar Aug 29 '22 15:08 prajdabre

Yes, we converted it.

On Mon, 29 Aug, 2022, 9:19 pm Raj Dabre, @.***> wrote:

Have you converted the scripts for non devanagari languages to devanagari?

Look here: https://github.com/AI4Bharat/indic-bart

That's likely the reason.

— Reply to this email directly, view it on GitHub https://github.com/prajdabre/yanmtt/issues/37#issuecomment-1230500105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIWJFZSZKTAB45YPTYYLWG3V3TLYJANCNFSM576B32QQ . You are receiving this because you authored the thread.Message ID: @.***>

Aniruddha-JU avatar Aug 29 '22 16:08 Aniruddha-JU

Can you give me the detailed log?

prajdabre avatar Aug 29 '22 16:08 prajdabre

Using label smoothing of 0.1 Using gradient clipping norm of 1.0 Using softmax temperature of 1.0 Masking ratio: 0.3 Training for: ['hi', 'kn', 'bn'] Shuffling corpus! Shuffling corpus! Shuffling corpus! Saving the model Loading from checkpoint Traceback (most recent call last): File "pretrain_nmt.py", line 968, in run_demo() File "pretrain_nmt.py", line 965, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap fn(i, *args) File "/home/aniruddha/yanmtt/pretrain_nmt.py", line 521, in model_create_load_run_save lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id File "/home/aniruddha/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss smooth_loss.masked_fill(pad_mask, 0.0) RuntimeError: The expanded size of the tensor (383) must match the existing size (382) at non-singleton dimension 1. Target sizes: [8, 383, 1]. Tensor sizes: [8, 382, 1]

Aniruddha-JU avatar Aug 29 '22 16:08 Aniruddha-JU

I mean the log right from the moment you ran the model. I need to see the tokenizer loading message etc.

prajdabre avatar Aug 29 '22 16:08 prajdabre

IP address is localhost Monolingual training files are: {'hi': '/home/aniruddha/all_data/train.hi', 'kn': '/home/aniruddha/all_data/train.kn', 'bn': '/home/aniruddha/all_data/train.bn'} Sharding files into 1 parts For language: hi the total number of lines are: 159354 and number of lines per shard are: 159354 File for language hi has been sharded. For language: kn the total number of lines are: 56715 and number of lines per shard are: 56715 File for language kn has been sharded. For language: bn the total number of lines are: 438796 and number of lines per shard are: 438796 File for language bn has been sharded. Sharding files into 1 parts Tokenizer is: PreTrainedTokenizer(name_or_path='ai4bharat/IndicBART', vocab_size=64000, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '', 'sep_token': '[SEP]', 'pad_token': '', 'cls_token': '[CLS]', 'mask_token': AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True), 'additional_special_tokens': ['', '', '<2as>', '<2bn>', '<2en>', '<2gu>', '<2hi>', '<2kn>', '<2ml>', '<2mr>', '<2or>', '<2pa>', '<2ta>', '<2te>']}) Running DDP checkpoint example on rank 0. We will do fp32 training 2022-08-29 21:44:47.535611: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory 2022-08-29 21:44:47.535653: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Using positional embeddings Using positional embeddings Memory consumed after moving model to GPU 0.91 GB Memory consumed after wrapping model in DDP 2.06 GB Optimizing ['module.model.shared.weight', 'module.model.encoder.embed_positions.weight', 'module.model.encoder.layers.0.self_attn.k_proj.weight', 'module.model.encoder.layers.0.self_attn.k_proj.bias', 'module.model.encoder.layers.0.self_attn.v_proj.weight', 'module.model.encoder.layers.0.self_attn.v_proj.bias', 'module.model.encoder.layers.0.self_attn.q_proj.weight', 'module.model.encoder.layers.0.self_attn.q_proj.bias', 'module.model.encoder.layers.0.self_attn.out_proj.weight', 'module.model.encoder.layers.0.self_attn.out_proj.bias', 'module.model.encoder.layers.0.self_attn_layer_norm.weight', 'module.model.encoder.layers.0.self_attn_layer_norm.bias', 'module.model.encoder.layers.0.fc1.weight', 'module.model.encoder.layers.0.fc1.bias', 'module.model.encoder.layers.0.fc2.weight', 'module.model.encoder.layers.0.fc2.bias', 'module.model.encoder.layers.0.final_layer_norm.weight', 'module.model.encoder.layers.0.final_layer_norm.bias', 'module.model.encoder.layers.1.self_attn.k_proj.weight', 'module.model.encoder.layers.1.self_attn.k_proj.bias', 'module.model.encoder.layers.1.self_attn.v_proj.weight', 'module.model.encoder.layers.1.self_attn.v_proj.bias', 'module.model.encoder.layers.1.self_attn.q_proj.weight', 'module.model.encoder.layers.1.self_attn.q_proj.bias', 'module.model.encoder.layers.1.self_attn.out_proj.weight', 'module.model.encoder.layers.1.self_attn.out_proj.bias', 'module.model.encoder.layers.1.self_attn_layer_norm.weight', 'module.model.encoder.layers.1.self_attn_layer_norm.bias', 'module.model.encoder.layers.1.fc1.weight', 'module.model.encoder.layers.1.fc1.bias', 'module.model.encoder.layers.1.fc2.weight', 'module.model.encoder.layers.1.fc2.bias', 'module.model.encoder.layers.1.final_layer_norm.weight', 'module.model.encoder.layers.1.final_layer_norm.bias', 'module.model.encoder.layers.2.self_attn.k_proj.weight', 'module.model.encoder.layers.2.self_attn.k_proj.bias', 'module.model.encoder.layers.2.self_attn.v_proj.weight', 'module.model.encoder.layers.2.self_attn.v_proj.bias', 'module.model.encoder.layers.2.self_attn.q_proj.weight', 'module.model.encoder.layers.2.self_attn.q_proj.bias', 'module.model.encoder.layers.2.self_attn.out_proj.weight', 'module.model.encoder.layers.2.self_attn.out_proj.bias', 'module.model.encoder.layers.2.self_attn_layer_norm.weight', 'module.model.encoder.layers.2.self_attn_layer_norm.bias', 'module.model.encoder.layers.2.fc1.weight', 'module.model.encoder.layers.2.fc1.bias', 'module.model.encoder.layers.2.fc2.weight', 'module.model.encoder.layers.2.fc2.bias', 'module.model.encoder.layers.2.final_layer_norm.weight', 'module.model.encoder.layers.2.final_layer_norm.bias', 'module.model.encoder.layers.3.self_attn.k_proj.weight', 'module.model.encoder.layers.3.self_attn.k_proj.bias', 'module.model.encoder.layers.3.self_attn.v_proj.weight', 'module.model.encoder.layers.3.self_attn.v_proj.bias', 'module.model.encoder.layers.3.self_attn.q_proj.weight', 'module.model.encoder.layers.3.self_attn.q_proj.bias', 'module.model.encoder.layers.3.self_attn.out_proj.weight', 'module.model.encoder.layers.3.self_attn.out_proj.bias', 'module.model.encoder.layers.3.self_attn_layer_norm.weight', 'module.model.encoder.layers.3.self_attn_layer_norm.bias', 'module.model.encoder.layers.3.fc1.weight', 'module.model.encoder.layers.3.fc1.bias', 'module.model.encoder.layers.3.fc2.weight', 'module.model.encoder.layers.3.fc2.bias', 'module.model.encoder.layers.3.final_layer_norm.weight', 'module.model.encoder.layers.3.final_layer_norm.bias', 'module.model.encoder.layers.4.self_attn.k_proj.weight', 'module.model.encoder.layers.4.self_attn.k_proj.bias', 'module.model.encoder.layers.4.self_attn.v_proj.weight', 'module.model.encoder.layers.4.self_attn.v_proj.bias', 'module.model.encoder.layers.4.self_attn.q_proj.weight', 'module.model.encoder.layers.4.self_attn.q_proj.bias', 'module.model.encoder.layers.4.self_attn.out_proj.weight', 'module.model.encoder.layers.4.self_attn.out_proj.bias', 'module.model.encoder.layers.4.self_attn_layer_norm.weight', 'module.model.encoder.layers.4.self_attn_layer_norm.bias', 'module.model.encoder.layers.4.fc1.weight', 'module.model.encoder.layers.4.fc1.bias', 'module.model.encoder.layers.4.fc2.weight', 'module.model.encoder.layers.4.fc2.bias', 'module.model.encoder.layers.4.final_layer_norm.weight', 'module.model.encoder.layers.4.final_layer_norm.bias', 'module.model.encoder.layers.5.self_attn.k_proj.weight', 'module.model.encoder.layers.5.self_attn.k_proj.bias', 'module.model.encoder.layers.5.self_attn.v_proj.weight', 'module.model.encoder.layers.5.self_attn.v_proj.bias', 'module.model.encoder.layers.5.self_attn.q_proj.weight', 'module.model.encoder.layers.5.self_attn.q_proj.bias', 'module.model.encoder.layers.5.self_attn.out_proj.weight', 'module.model.encoder.layers.5.self_attn.out_proj.bias', 'module.model.encoder.layers.5.self_attn_layer_norm.weight', 'module.model.encoder.layers.5.self_attn_layer_norm.bias', 'module.model.encoder.layers.5.fc1.weight', 'module.model.encoder.layers.5.fc1.bias', 'module.model.encoder.layers.5.fc2.weight', 'module.model.encoder.layers.5.fc2.bias', 'module.model.encoder.layers.5.final_layer_norm.weight', 'module.model.encoder.layers.5.final_layer_norm.bias', 'module.model.encoder.layernorm_embedding.weight', 'module.model.encoder.layernorm_embedding.bias', 'module.model.encoder.layer_norm.weight', 'module.model.encoder.layer_norm.bias', 'module.model.decoder.embed_positions.weight', 'module.model.decoder.layers.0.self_attn.k_proj.weight', 'module.model.decoder.layers.0.self_attn.k_proj.bias', 'module.model.decoder.layers.0.self_attn.v_proj.weight', 'module.model.decoder.layers.0.self_attn.v_proj.bias', 'module.model.decoder.layers.0.self_attn.q_proj.weight', 'module.model.decoder.layers.0.self_attn.q_proj.bias', 'module.model.decoder.layers.0.self_attn.out_proj.weight', 'module.model.decoder.layers.0.self_attn.out_proj.bias', 'module.model.decoder.layers.0.self_attn_layer_norm.weight', 'module.model.decoder.layers.0.self_attn_layer_norm.bias', 'module.model.decoder.layers.0.encoder_attn.k_proj.weight', 'module.model.decoder.layers.0.encoder_attn.k_proj.bias', 'module.model.decoder.layers.0.encoder_attn.v_proj.weight', 'module.model.decoder.layers.0.encoder_attn.v_proj.bias', 'module.model.decoder.layers.0.encoder_attn.q_proj.weight', 'module.model.decoder.layers.0.encoder_attn.q_proj.bias', 'module.model.decoder.layers.0.encoder_attn.out_proj.weight', 'module.model.decoder.layers.0.encoder_attn.out_proj.bias', 'module.model.decoder.layers.0.encoder_attn_layer_norm.weight', 'module.model.decoder.layers.0.encoder_attn_layer_norm.bias', 'module.model.decoder.layers.0.fc1.weight', 'module.model.decoder.layers.0.fc1.bias', 'module.model.decoder.layers.0.fc2.weight', 'module.model.decoder.layers.0.fc2.bias', 'module.model.decoder.layers.0.final_layer_norm.weight', 'module.model.decoder.layers.0.final_layer_norm.bias', 'module.model.decoder.layers.1.self_attn.k_proj.weight', 'module.model.decoder.layers.1.self_attn.k_proj.bias', 'module.model.decoder.layers.1.self_attn.v_proj.weight', 'module.model.decoder.layers.1.self_attn.v_proj.bias', 'module.model.decoder.layers.1.self_attn.q_proj.weight', 'module.model.decoder.layers.1.self_attn.q_proj.bias', 'module.model.decoder.layers.1.self_attn.out_proj.weight', 'module.model.decoder.layers.1.self_attn.out_proj.bias', 'module.model.decoder.layers.1.self_attn_layer_norm.weight', 'module.model.decoder.layers.1.self_attn_layer_norm.bias', 'module.model.decoder.layers.1.encoder_attn.k_proj.weight', 'module.model.decoder.layers.1.encoder_attn.k_proj.bias', 'module.model.decoder.layers.1.encoder_attn.v_proj.weight', 'module.model.decoder.layers.1.encoder_attn.v_proj.bias', 'module.model.decoder.layers.1.encoder_attn.q_proj.weight', 'module.model.decoder.layers.1.encoder_attn.q_proj.bias', 'module.model.decoder.layers.1.encoder_attn.out_proj.weight', 'module.model.decoder.layers.1.encoder_attn.out_proj.bias', 'module.model.decoder.layers.1.encoder_attn_layer_norm.weight', 'module.model.decoder.layers.1.encoder_attn_layer_norm.bias', 'module.model.decoder.layers.1.fc1.weight', 'module.model.decoder.layers.1.fc1.bias', 'module.model.decoder.layers.1.fc2.weight', 'module.model.decoder.layers.1.fc2.bias', 'module.model.decoder.layers.1.final_layer_norm.weight', 'module.model.decoder.layers.1.final_layer_norm.bias', 'module.model.decoder.layers.2.self_attn.k_proj.weight', 'module.model.decoder.layers.2.self_attn.k_proj.bias', 'module.model.decoder.layers.2.self_attn.v_proj.weight', 'module.model.decoder.layers.2.self_attn.v_proj.bias', 'module.model.decoder.layers.2.self_attn.q_proj.weight', 'module.model.decoder.layers.2.self_attn.q_proj.bias', 'module.model.decoder.layers.2.self_attn.out_proj.weight', 'module.model.decoder.layers.2.self_attn.out_proj.bias', 'module.model.decoder.layers.2.self_attn_layer_norm.weight', 'module.model.decoder.layers.2.self_attn_layer_norm.bias', 'module.model.decoder.layers.2.encoder_attn.k_proj.weight', 'module.model.decoder.layers.2.encoder_attn.k_proj.bias', 'module.model.decoder.layers.2.encoder_attn.v_proj.weight', 'module.model.decoder.layers.2.encoder_attn.v_proj.bias', 'module.model.decoder.layers.2.encoder_attn.q_proj.weight', 'module.model.decoder.layers.2.encoder_attn.q_proj.bias', 'module.model.decoder.layers.2.encoder_attn.out_proj.weight', 'module.model.decoder.layers.2.encoder_attn.out_proj.bias', 'module.model.decoder.layers.2.encoder_attn_layer_norm.weight', 'module.model.decoder.layers.2.encoder_attn_layer_norm.bias', 'module.model.decoder.layers.2.fc1.weight', 'module.model.decoder.layers.2.fc1.bias', 'module.model.decoder.layers.2.fc2.weight', 'module.model.decoder.layers.2.fc2.bias', 'module.model.decoder.layers.2.final_layer_norm.weight', 'module.model.decoder.layers.2.final_layer_norm.bias', 'module.model.decoder.layers.3.self_attn.k_proj.weight', 'module.model.decoder.layers.3.self_attn.k_proj.bias', 'module.model.decoder.layers.3.self_attn.v_proj.weight', 'module.model.decoder.layers.3.self_attn.v_proj.bias', 'module.model.decoder.layers.3.self_attn.q_proj.weight', 'module.model.decoder.layers.3.self_attn.q_proj.bias', 'module.model.decoder.layers.3.self_attn.out_proj.weight', 'module.model.decoder.layers.3.self_attn.out_proj.bias', 'module.model.decoder.layers.3.self_attn_layer_norm.weight', 'module.model.decoder.layers.3.self_attn_layer_norm.bias', 'module.model.decoder.layers.3.encoder_attn.k_proj.weight', 'module.model.decoder.layers.3.encoder_attn.k_proj.bias', 'module.model.decoder.layers.3.encoder_attn.v_proj.weight', 'module.model.decoder.layers.3.encoder_attn.v_proj.bias', 'module.model.decoder.layers.3.encoder_attn.q_proj.weight', 'module.model.decoder.layers.3.encoder_attn.q_proj.bias', 'module.model.decoder.layers.3.encoder_attn.out_proj.weight', 'module.model.decoder.layers.3.encoder_attn.out_proj.bias', 'module.model.decoder.layers.3.encoder_attn_layer_norm.weight', 'module.model.decoder.layers.3.encoder_attn_layer_norm.bias', 'module.model.decoder.layers.3.fc1.weight', 'module.model.decoder.layers.3.fc1.bias', 'module.model.decoder.layers.3.fc2.weight', 'module.model.decoder.layers.3.fc2.bias', 'module.model.decoder.layers.3.final_layer_norm.weight', 'module.model.decoder.layers.3.final_layer_norm.bias', 'module.model.decoder.layers.4.self_attn.k_proj.weight', 'module.model.decoder.layers.4.self_attn.k_proj.bias', 'module.model.decoder.layers.4.self_attn.v_proj.weight', 'module.model.decoder.layers.4.self_attn.v_proj.bias', 'module.model.decoder.layers.4.self_attn.q_proj.weight', 'module.model.decoder.layers.4.self_attn.q_proj.bias', 'module.model.decoder.layers.4.self_attn.out_proj.weight', 'module.model.decoder.layers.4.self_attn.out_proj.bias', 'module.model.decoder.layers.4.self_attn_layer_norm.weight', 'module.model.decoder.layers.4.self_attn_layer_norm.bias', 'module.model.decoder.layers.4.encoder_attn.k_proj.weight', 'module.model.decoder.layers.4.encoder_attn.k_proj.bias', 'module.model.decoder.layers.4.encoder_attn.v_proj.weight', 'module.model.decoder.layers.4.encoder_attn.v_proj.bias', 'module.model.decoder.layers.4.encoder_attn.q_proj.weight', 'module.model.decoder.layers.4.encoder_attn.q_proj.bias', 'module.model.decoder.layers.4.encoder_attn.out_proj.weight', 'module.model.decoder.layers.4.encoder_attn.out_proj.bias', 'module.model.decoder.layers.4.encoder_attn_layer_norm.weight', 'module.model.decoder.layers.4.encoder_attn_layer_norm.bias', 'module.model.decoder.layers.4.fc1.weight', 'module.model.decoder.layers.4.fc1.bias', 'module.model.decoder.layers.4.fc2.weight', 'module.model.decoder.layers.4.fc2.bias', 'module.model.decoder.layers.4.final_layer_norm.weight', 'module.model.decoder.layers.4.final_layer_norm.bias', 'module.model.decoder.layers.5.self_attn.k_proj.weight', 'module.model.decoder.layers.5.self_attn.k_proj.bias', 'module.model.decoder.layers.5.self_attn.v_proj.weight', 'module.model.decoder.layers.5.self_attn.v_proj.bias', 'module.model.decoder.layers.5.self_attn.q_proj.weight', 'module.model.decoder.layers.5.self_attn.q_proj.bias', 'module.model.decoder.layers.5.self_attn.out_proj.weight', 'module.model.decoder.layers.5.self_attn.out_proj.bias', 'module.model.decoder.layers.5.self_attn_layer_norm.weight', 'module.model.decoder.layers.5.self_attn_layer_norm.bias', 'module.model.decoder.layers.5.encoder_attn.k_proj.weight', 'module.model.decoder.layers.5.encoder_attn.k_proj.bias', 'module.model.decoder.layers.5.encoder_attn.v_proj.weight', 'module.model.decoder.layers.5.encoder_attn.v_proj.bias', 'module.model.decoder.layers.5.encoder_attn.q_proj.weight', 'module.model.decoder.layers.5.encoder_attn.q_proj.bias', 'module.model.decoder.layers.5.encoder_attn.out_proj.weight', 'module.model.decoder.layers.5.encoder_attn.out_proj.bias', 'module.model.decoder.layers.5.encoder_attn_layer_norm.weight', 'module.model.decoder.layers.5.encoder_attn_layer_norm.bias', 'module.model.decoder.layers.5.fc1.weight', 'module.model.decoder.layers.5.fc1.bias', 'module.model.decoder.layers.5.fc2.weight', 'module.model.decoder.layers.5.fc2.bias', 'module.model.decoder.layers.5.final_layer_norm.weight', 'module.model.decoder.layers.5.final_layer_norm.bias', 'module.model.decoder.layernorm_embedding.weight', 'module.model.decoder.layernorm_embedding.bias', 'module.model.decoder.layer_norm.weight', 'module.model.decoder.layer_norm.bias'] Number of model parameters: 244017152 Total number of params to be optimized are: 244017152 Percentage of parameters to be optimized: 100.0 /home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr(). warnings.warn("To get the last learning rate computed by the scheduler, " /home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) Initial LR is: 1.25e-07 Training from official pretrained model /home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) /home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:234: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) Using label smoothing of 0.1 Using gradient clipping norm of 1.0 Using softmax temperature of 1.0 Masking ratio: 0.3 Training for: ['hi', 'kn', 'bn'] Shuffling corpus! Shuffling corpus! Shuffling corpus! Saving the model Loading from checkpoint Traceback (most recent call last): File "pretrain_nmt.py", line 968, in run_demo() File "pretrain_nmt.py", line 965, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap fn(i, *args) File "/home/aniruddha/yanmtt/pretrain_nmt.py", line 521, in model_create

Aniruddha-JU avatar Aug 29 '22 16:08 Aniruddha-JU

Your problem is likely here

PreTrainedTokenizer(name_or_path='ai4bharat/IndicBART', vocab_size=64000, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '', 'sep_token': '[SEP]', 'pad_token': '', 'cls_token': '[CLS]', 'mask_token': AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True), 'additional_special_tokens': ['', '', '<2as>', '<2bn>', '<2en>', '<2gu>', '<2hi>', '<2kn>', '<2ml>', '<2mr>', '<2or>', '<2pa>', '<2ta>', '<2te>']}) Running DDP checkpoint example on rank 0. We will do fp32 training 2022-08-29 21:44:47.535611: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory 2022-08-29 21:44:47.535653: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not

Firstly, I am not sure why the tokens and are missing from the tokenizer. Right now they seem to be empty. I am 99% sure that the code changes I made last week were correct. I will check again just to be sure but even without that there seems to be some issue with your cuda installation.

prajdabre avatar Aug 29 '22 16:08 prajdabre

Actually, we are using dgx A100 server, and the cuda is installed by Nvidia itself.

On Mon, 29 Aug, 2022, 10:04 pm Raj Dabre, @.***> wrote:

Your problem is likely here

PreTrainedTokenizer(name_or_path='ai4bharat/IndicBART', vocab_size=64000, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '', 'sep_token': '[SEP]', 'pad_token': '', 'cls_token': '[CLS]', 'mask_token': AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=True), 'additional_special_tokens': ['', '', '<2as>', '<2bn>', '<2en>', '<2gu>', '<2hi>', '<2kn>', '<2ml>', '<2mr>', '<2or>', '<2pa>', '<2ta>', '<2te>']}) Running DDP checkpoint example on rank 0. We will do fp32 training 2022-08-29 21:44:47.535611: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory 2022-08-29 21:44:47.535653: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not

Firstly, I am not sure why the tokens and are missing from the tokenizer. Right now they seem to be empty. I am 99% sure that the code changes I made last week were correct. I will check again just to be sure but even without that there seems to be some issue with your cuda installation.

— Reply to this email directly, view it on GitHub https://github.com/prajdabre/yanmtt/issues/37#issuecomment-1230553328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIWJFZXZ65ILCDGSEP4MUMDV3TRCJANCNFSM576B32QQ . You are receiving this because you authored the thread.Message ID: @.***>

Aniruddha-JU avatar Aug 29 '22 16:08 Aniruddha-JU

Oh wait the tokens are not missing. The confusion was because you copypasted the log without using the ``.

For example: and will be displayed weirdly with a dash over the word and. But: <s> and </s> is displayed correctly.

In the future please post logs with ``

As for dgx, I dont think that is the problem. However, I have never worked with dgx so I cant be sure. Since I dont have a dgx I cant debug dgx issues.

Nevertheless, the fact is that the following error needs to be solved: 08-29 21:44:47.535611: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory 2022-08-29 21:44:47.535653: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not

Unless that happens I cant think of any other solution. I recommend googling "Could not load dynamic library 'libcudart.so.10.1';" and solving that issue first.

prajdabre avatar Aug 29 '22 16:08 prajdabre

but when I am passing one language, then it is working

Aniruddha-JU avatar Aug 29 '22 16:08 Aniruddha-JU

The problem can be your data as well.

Can you try running with the individual languages to identify the problematic one?

prajdabre avatar Aug 29 '22 16:08 prajdabre

कोई भी एजेंसी-होल्डर अधिक पैसे लेने के उद्देश्य से पाठ्य पुस्तकों पर जिल्दबन्दी नहीं कर सकता । पंजाब स्कूल शिक्षा बोर्ड द्वारा मुद्रित तथा प्रकाशित पाठ्य पुस्तकों कोतकों) की छपाई, प्रकाशन, स्टॉक करना, जमाखोरी या बिक्री आदि करना भारतीय दंड प्रणाली के अन्तर्गत गैरकानूनी जुर्म है । सचिव, पंजाब स्कूल शिक्षा बोर्ड, विद्यागर-160062 द्वारा प्रकाशित तथा मैस पंजाब किताब घर, जालन्धर द्वारा मुद्रित ।


Aniruddha-JU avatar Aug 29 '22 17:08 Aniruddha-JU

above is the sample for hindi

Aniruddha-JU avatar Aug 29 '22 17:08 Aniruddha-JU

I run with another server..the same problems happens..now cuda problem is not showing

Aniruddha-JU avatar Aug 29 '22 17:08 Aniruddha-JU

Hi,

I figured out the issue.

The problem is that when official models are used, you need to pass the language indicator tokens directly to the script.

So before you passed --langs hi,kn,bn now you need to pass --langs "<2hi>,<2kn>,<2bn>"

Try it and let me know.

prajdabre avatar Aug 30 '22 07:08 prajdabre