LMOps
LMOps copied to clipboard
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I got this error while doing Fine-tuning:
File "/mnt/oss-data/xxx/minillm/transformers/src/transformers/generation/utils.py", line 3000, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf
, nan
or element < 0
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf
, nan
or element < 0
[I ProcessGroupNCCL.cpp:844] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12080 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12081 closing signal SIGTERM
...
Does anyone see the same issue?
model config: { "_name_or_path": "/xxx", "architectures": [ "LlamaForCausalLM" ], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 13824, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 40, "num_hidden_layers": 40, "num_key_value_heads": 40, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.31.0", "use_cache": true, "vocab_size": 55296 }
cmd: export NCCL_DEBUG="" export WANDB_DISABLED=True export TF_CPP_MIN_LOG_LEVEL=0 export TORCH_CPP_LOG_LEVEL=0 torchrun --nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 2012 /mnt/oss-data/xxx/minillm/finetune.py --base-path /mnt/oss-data/xxx/minillm --model-path /mnt/oss-data/xxx/minillm/xxx/ --ckpt-name xxx --n-gpu 4 --model-type llama2 --gradient-checkpointing --model-parallel --model-parallel-size 4 --data-dir /mnt/oss-data/xxx/minillm/processed_data/dolly/full/llama2/ --num-workers 0 --dev-num 500 --lr 0.00001 --batch-size 4 --eval-batch-size 8 --gradient-accumulation-steps 2 --warmup-iters 0 --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --epochs 10 --max-length 512 --max-prompt-length 256 --do-train --do-valid --eval-gen --save-interval -1 --eval-interval -1 --log-interval 4 --mid-log-num 1 --save /mnt/oss-data/xxx/minillm/results/llama2/train/sft --seed 10 --seed-order 10 --deepspeed --deepspeed_config /mnt/oss-data/xxx/minillm/configs/deepspeed/ds_config_zero2_offload.json --type lm --do-sample --top-k 1 --top-p 0.9 --temperature 1.0