llama-recipes
llama-recipes copied to clipboard
No output folder
System Info
Collecting environment information... PyTorch version: 2.2.0.dev20230912+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.27.2 Libc version: glibc-2.31
Python version: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] (64-bit runtime) Python platform: Linux-5.15.0-1040-azure-x86_64-with-glibc2.10 Is CUDA available: True CUDA runtime version: 11.3.109 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe GPU 1: NVIDIA A100 80GB PCIe GPU 2: NVIDIA A100 80GB PCIe GPU 3: NVIDIA A100 80GB PCIe
Nvidia driver version: 470.182.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 4 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7V13 64-Core Processor Stepping: 1 CPU MHz: 2445.438 BogoMIPS: 4890.87 Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 3 MiB L1i cache: 3 MiB L2 cache: 48 MiB L3 cache: 384 MiB NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47 NUMA node2 CPU(s): 48-71 NUMA node3 CPU(s): 72-95 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm
Versions of relevant libraries: [pip3] flake8==6.0.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.21.6 [pip3] pytorch-transformers==1.0.0 [pip3] pytorch-triton==2.1.0+6e4932cda8 [pip3] torch==2.2.0.dev20230912+cu118 [pip3] torch-tb-profiler==0.4.1 [pip3] torchaudio==2.2.0.dev20230912+cu118 [pip3] torchvision==0.9.1 [pip3] triton==2.0.0 [conda] _pytorch_select 0.1 cpu_0 anaconda [conda] blas 1.0 mkl anaconda [conda] cudatoolkit 10.1.243 h6bb024c_0 anaconda [conda] libmklml 2019.0.5 h06a4308_0 anaconda [conda] mkl 2020.2 256 anaconda [conda] numpy 1.21.6 py38h1d589f8_0 conda-forge [conda] pytorch-transformers 1.0.0 pypi_0 pypi [conda] pytorch-triton 2.1.0+6e4932cda8 pypi_0 pypi [conda] torch 2.2.0.dev20230912+cu118 pypi_0 pypi [conda] torch-tb-profiler 0.4.1 pypi_0 pypi [conda] torchaudio 2.2.0.dev20230912+cu118 pypi_0 pypi [conda] torchvision 0.9.1 py38_cu101 pytorch [conda] triton 2.0.0 pypi_0 pypi
Information
- [ ] The official example scripts
- [X] My own modified scripts
🐛 Describe the bug
Using custom dataset with a custom data loader(5000 samples). Using the following command for fine-tuning.
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py
The process completes without any issues , however I don't see the provided output/checkpoint folder being created. I've tried running from different directory , creating the directory manually , change the format of the directory name.
Train config:
@dataclass
class train_config:
model_name: str="meta-llama/Llama-2-13b-hf"
enable_fsdp: bool=True
low_cpu_fsdp: bool=True
run_validation: bool=False
batch_size_training: int=12
gradient_accumulation_steps: int=8
num_epochs: int=1
num_workers_dataloader: int=4
lr: float=2e-4
weight_decay: float=0.0
gamma: float= 0.85
seed: int=42
use_fp16: bool=False
mixed_precision: bool=True
val_batch_size: int=1
dataset = "legal_dataset"
peft_method: str = "lora" # None , llama_adapter, prefix
use_peft: bool=True
output_dir: str = "./ft-output"
freeze_layers: bool = False
num_freeze_layers: int = 1
quantization: bool = False
one_gpu: bool = False
save_model: bool = True
dist_checkpoint_root_folder: str="model_checkpoints" # will be used if using FSDP
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
save_optimizer: bool=True # will be used if using FSDP
use_fast_kernels: bool = True
Error logs
Logs from terminal:
Training Epoch: 1/1, step 16/18 completed (loss: 0.20255453884601593): : 3it [12:14, 214.40s/it] Training Epoch: 1/1, step 16/18 completed (loss: 0.24286626279354095): : 3it [12:15, 214.48s/it] Training Epoch: 1/1, step 16/18 completed (loss: 0.2025240808725357): : 3it [12:15, 214.51s/it] Training Epoch: 1/1, step 16/18 completed (loss: 0.22577618062496185): : 3it [12:14, 214.43s/it] Training Epoch: 1/1, step 17/18 completed (loss: 0.2233143001794815): : 3it [12:15, 245.06s/it]
Training Epoch: 1/1, step 17/18 completed (loss: 0.20939956605434418): : 3it [12:14, 244.96s/it]
Training Epoch: 1/1, step 17/18 completed (loss: 0.20539191365242004): : 3it [12:15, 245.12s/it]
Training Epoch: 1/1, step 17/18 completed (loss: 0.2597067356109619): : 3it [12:14, 244.99s/it]
Max CUDA memory allocated was 45 GB
Max CUDA memory reserved was 54 GB
Peak active CUDA memory was 46 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 7 GB
Epoch 1: train_perplexity=1.2628, train_epoch_loss=0.2333, epoch time 736.3742194380029s
Key: avg_train_prep, Value: 1.262771725654602
Key: avg_train_loss, Value: 0.2333090603351593
Key: avg_epoch_time, Value: 736.3742194380029
Key: avg_checkpoint_time, Value: 0
Expected behavior
Output folder is created.
@Tejaswgupta thanks for flagging this. We need to revisit the saving logic. You selected run_validation: bool=False in your config which effectively disables saving of the result. I'll try to create a PR asap. In the meantime just enable run_validation and you should get the parameters saved. https://github.com/facebookresearch/llama-recipes/blob/c38bf5bdd370ceb93e71cfec1a07b0885a57e3ec/src/llama_recipes/utils/train_utils.py#L131
My parameter run_validation
is True, but the model file still cannot be output. My configuration is as follows:
@dataclass
class train_config:
model_name: str="PATH/to/LLAMA/7B"
enable_fsdp: bool=False
low_cpu_fsdp: bool=False
run_validation: bool=True
batch_size_training: int=4
gradient_accumulation_steps: int=1
num_epochs: int=1
num_workers_dataloader: int=1
lr: float=1e-4
weight_decay: float=0.0
gamma: float= 0.85
seed: int=42
use_fp16: bool=False
mixed_precision: bool=True
val_batch_size: int=1
dataset = "samsum_dataset"
peft_method: str = "lora" # None , llama_adapter, prefix
use_peft: bool=False
output_dir: str = "PATH/to/save/PEFT/model"
freeze_layers: bool = False
num_freeze_layers: int = 1
quantization: bool = False
one_gpu: bool = False
save_model: bool = True
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
save_optimizer: bool=False # will be used if using FSDP
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
My command is as follows:
torchrun --nnodes 1 --nproc_per_node 2 /GPUFS/nsccgz_ywang_zfd/chenchong/llama-recipes-main/finetuning.py \
--enable_fsdp --use_peft --peft_method lora \
--model_name /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf \
--pure_bf16 \
--output_dir /GPUFS/nsccgz_ywang_zfd/chenchong/abtemp \
--dataset alpaca_dataset \
--data_path /GPUFS/nsccgz_ywang_zfd/chenchong/dataset.json \
--batch_size_training 256 \
--micro_batch_size 16 \
2>&1|tee /GPUFS/nsccgz_ywang_zfd/chenchong/ft.log
Everything was okay during the fine-tuning process, but there is no model file output. How should I handle it?
Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.
Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.
Here is my log:
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-09-18 08:00:02,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-18 08:00:02,256] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.31s/it]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.38s/it]
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:10<00:05, 5.19s/it]
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:10<00:05, 5.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.16s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.46s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.50s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
--> Model /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf
--> /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf has 13016.02816 Million params
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
The class this function is called from is 'LlamaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
The class this function is called from is 'LlamaTokenizer'.
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
bFloat16 enabled for mixed precision - using bfSixteen policy
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
--> applying fsdp activation checkpointing...
--> Training Set Length = 6233
--> Validation Set Length = 200
--> applying fsdp activation checkpointing...
/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%|[34m [0m| 0/12 [00:00<?, ?it/s]/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%|[34m [0m| 0/12 [00:00<?, ?it/s]
Training Epoch: 1: 8%|[34mâ–Š [0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002): 8%|[34mâ–Š [0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1: 8%|[34mâ–Š [0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802): 8%|[34mâ–Š [0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802): 17%|[34m█▋ [0m| 2/12 [01:09<05:44, 34.42s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772): 17%|[34m█▋ [0m| 2/12 [01:09<05:44, 34.42s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002): 17%|[34m█▋ [0m| 2/12 [01:09<05:42, 34.26s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143): 17%|[34m█▋ [0m| 2/12 [01:09<05:42, 34.26s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143): 25%|[34m██▌ [0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165): 25%|[34m██▌ [0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772): 25%|[34m██▌ [0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344): 25%|[34m██▌ [0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344): 33%|[34m███▎ [0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268): 33%|[34m███▎ [0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165): 33%|[34m███▎ [0m| 4/12 [02:17<04:33, 34.14s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208): 33%|[34m███▎ [0m| 4/12 [02:17<04:33, 34.14s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208): 42%|[34m████� [0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908): 42%|[34m████� [0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268): 42%|[34m████� [0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362): 42%|[34m████� [0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908): 50%|[34m█████ [0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963): 50%|[34m█████ [0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362): 50%|[34m█████ [0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293): 50%|[34m█████ [0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293): 58%|[34m█████▊ [0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674): 58%|[34m█████▊ [0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963): 58%|[34m█████▊ [0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637): 58%|[34m█████▊ [0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674): 67%|[34m██████▋ [0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295): 67%|[34m██████▋ [0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637): 67%|[34m██████▋ [0m| 8/12 [04:30<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533): 67%|[34m██████▋ [0m| 8/12 [04:30<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295): 75%|[34m███████▌ [0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682): 75%|[34m███████▌ [0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533): 75%|[34m███████▌ [0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212): 75%|[34m███████▌ [0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682): 83%|[34m████████▎ [0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923): 83%|[34m████████▎ [0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212): 83%|[34m████████▎ [0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906): 83%|[34m████████▎ [0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923): 92%|[34m█████████�[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 92%|[34m█████████�[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906): 92%|[34m█████████�[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 92%|[34m█████████�[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.72s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.69s/it]
Max CUDA memory allocated was 58 GB
Max CUDA memory reserved was 77 GB
Peak active CUDA memory was 58 GB
Cuda Malloc retires : 35
CPU Total Peak Memory consumed during the train (max): 3 GB
evaluating Epoch: 0%|[32m [0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch: 0%|[32m [0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch: 1%|[32m [0m| 1/100 [00:02<03:25, 2.08s/it]
evaluating Epoch: 1%|[32m [0m| 1/100 [00:01<01:39, 1.00s/it]
evaluating Epoch: 2%|[32mâ–� [0m| 2/100 [00:01<01:26, 1.13it/s]
evaluating Epoch: 2%|[32mâ–� [0m| 2/100 [00:02<02:10, 1.33s/it]
evaluating Epoch: 3%|[32mâ–Ž [0m| 3/100 [00:03<01:45, 1.09s/it]
evaluating Epoch: 3%|[32mâ–Ž [0m| 3/100 [00:02<01:22, 1.18it/s]
evaluating Epoch: 4%|[32mâ–� [0m| 4/100 [00:04<01:33, 1.02it/s]
evaluating Epoch: 4%|[32mâ–� [0m| 4/100 [00:03<01:19, 1.20it/s]
evaluating Epoch: 5%|[32m▌ [0m| 5/100 [00:04<01:18, 1.22it/s]
evaluating Epoch: 5%|[32m▌ [0m| 5/100 [00:05<01:27, 1.09it/s]
evaluating Epoch: 6%|[32m▌ [0m| 6/100 [00:05<01:16, 1.22it/s]
evaluating Epoch: 6%|[32m▌ [0m| 6/100 [00:06<01:22, 1.14it/s]
evaluating Epoch: 7%|[32mâ–‹ [0m| 7/100 [00:05<01:15, 1.23it/s]
evaluating Epoch: 7%|[32mâ–‹ [0m| 7/100 [00:06<01:19, 1.17it/s]
evaluating Epoch: 8%|[32mâ–Š [0m| 8/100 [00:06<01:14, 1.23it/s]
evaluating Epoch: 8%|[32mâ–Š [0m| 8/100 [00:07<01:17, 1.19it/s]
evaluating Epoch: 9%|[32mâ–‰ [0m| 9/100 [00:07<01:13, 1.24it/s]
evaluating Epoch: 9%|[32mâ–‰ [0m| 9/100 [00:08<01:15, 1.21it/s]
evaluating Epoch: 10%|[32mâ–ˆ [0m| 10/100 [00:09<01:13, 1.22it/s]
evaluating Epoch: 10%|[32mâ–ˆ [0m| 10/100 [00:08<01:12, 1.24it/s]
evaluating Epoch: 11%|[32mâ–ˆ [0m| 11/100 [00:10<01:12, 1.23it/s]
evaluating Epoch: 11%|[32mâ–ˆ [0m| 11/100 [00:09<01:11, 1.24it/s]
evaluating Epoch: 12%|[32m█� [0m| 12/100 [00:09<01:10, 1.24it/s]
evaluating Epoch: 12%|[32m█� [0m| 12/100 [00:10<01:11, 1.23it/s]
evaluating Epoch: 13%|[32m█▎ [0m| 13/100 [00:11<01:10, 1.24it/s]
evaluating Epoch: 13%|[32m█▎ [0m| 13/100 [00:10<01:09, 1.24it/s]
evaluating Epoch: 14%|[32m█� [0m| 14/100 [00:12<01:09, 1.24it/s]
evaluating Epoch: 14%|[32m█� [0m| 14/100 [00:11<01:09, 1.25it/s]
evaluating Epoch: 15%|[32m█▌ [0m| 15/100 [00:13<01:08, 1.24it/s]
evaluating Epoch: 15%|[32m█▌ [0m| 15/100 [00:12<01:08, 1.24it/s]
evaluating Epoch: 16%|[32m█▌ [0m| 16/100 [00:13<01:07, 1.25it/s]
evaluating Epoch: 16%|[32m█▌ [0m| 16/100 [00:14<01:07, 1.24it/s]
evaluating Epoch: 17%|[32m█▋ [0m| 17/100 [00:13<01:07, 1.24it/s]
evaluating Epoch: 17%|[32m█▋ [0m| 17/100 [00:14<01:07, 1.23it/s]
evaluating Epoch: 18%|[32m█▊ [0m| 18/100 [00:14<01:06, 1.24it/s]
evaluating Epoch: 18%|[32m█▊ [0m| 18/100 [00:15<01:06, 1.24it/s]
evaluating Epoch: 19%|[32m█▉ [0m| 19/100 [00:16<01:05, 1.24it/s]
evaluating Epoch: 19%|[32m█▉ [0m| 19/100 [00:15<01:05, 1.24it/s]
evaluating Epoch: 20%|[32m██ [0m| 20/100 [00:16<01:04, 1.24it/s]
evaluating Epoch: 20%|[32m██ [0m| 20/100 [00:17<01:04, 1.24it/s]
evaluating Epoch: 21%|[32m██ [0m| 21/100 [00:17<01:03, 1.24it/s]
evaluating Epoch: 21%|[32m██ [0m| 21/100 [00:18<01:03, 1.24it/s]
evaluating Epoch: 22%|[32m██� [0m| 22/100 [00:17<01:02, 1.24it/s]
evaluating Epoch: 22%|[32m██� [0m| 22/100 [00:18<01:02, 1.24it/s]
evaluating Epoch: 23%|[32m██▎ [0m| 23/100 [00:19<01:02, 1.24it/s]
evaluating Epoch: 23%|[32m██▎ [0m| 23/100 [00:18<01:02, 1.24it/s]
evaluating Epoch: 24%|[32m██� [0m| 24/100 [00:20<01:01, 1.24it/s]
evaluating Epoch: 24%|[32m██� [0m| 24/100 [00:19<01:01, 1.24it/s]
evaluating Epoch: 25%|[32m██▌ [0m| 25/100 [00:21<01:00, 1.24it/s]
evaluating Epoch: 25%|[32m██▌ [0m| 25/100 [00:20<01:00, 1.24it/s]
evaluating Epoch: 26%|[32m██▌ [0m| 26/100 [00:22<00:59, 1.24it/s]
evaluating Epoch: 26%|[32m██▌ [0m| 26/100 [00:21<00:59, 1.24it/s]
evaluating Epoch: 27%|[32m██▋ [0m| 27/100 [00:21<00:58, 1.24it/s]
evaluating Epoch: 27%|[32m██▋ [0m| 27/100 [00:23<00:58, 1.24it/s]
evaluating Epoch: 28%|[32m██▊ [0m| 28/100 [00:22<00:57, 1.24it/s]
evaluating Epoch: 28%|[32m██▊ [0m| 28/100 [00:23<00:57, 1.24it/s]
evaluating Epoch: 29%|[32m██▉ [0m| 29/100 [00:23<00:57, 1.24it/s]
evaluating Epoch: 29%|[32m██▉ [0m| 29/100 [00:24<00:57, 1.24it/s]
evaluating Epoch: 30%|[32m███ [0m| 30/100 [00:25<00:56, 1.24it/s]
evaluating Epoch: 30%|[32m███ [0m| 30/100 [00:24<00:56, 1.24it/s]
evaluating Epoch: 31%|[32m███ [0m| 31/100 [00:25<00:55, 1.24it/s]
evaluating Epoch: 31%|[32m███ [0m| 31/100 [00:26<00:55, 1.24it/s]
evaluating Epoch: 32%|[32m███� [0m| 32/100 [00:27<00:54, 1.25it/s]
evaluating Epoch: 32%|[32m███� [0m| 32/100 [00:25<00:54, 1.24it/s]
evaluating Epoch: 33%|[32m███▎ [0m| 33/100 [00:26<00:53, 1.24it/s]
evaluating Epoch: 33%|[32m███▎ [0m| 33/100 [00:27<00:53, 1.24it/s]
evaluating Epoch: 34%|[32m███� [0m| 34/100 [00:28<00:52, 1.25it/s]
evaluating Epoch: 34%|[32m███� [0m| 34/100 [00:27<00:52, 1.25it/s]
evaluating Epoch: 35%|[32m███▌ [0m| 35/100 [00:28<00:52, 1.24it/s]
evaluating Epoch: 35%|[32m███▌ [0m| 35/100 [00:29<00:52, 1.24it/s]
evaluating Epoch: 36%|[32m███▌ [0m| 36/100 [00:30<00:51, 1.25it/s]
evaluating Epoch: 36%|[32m███▌ [0m| 36/100 [00:29<00:51, 1.25it/s]
evaluating Epoch: 37%|[32m███▋ [0m| 37/100 [00:29<00:50, 1.25it/s]
evaluating Epoch: 37%|[32m███▋ [0m| 37/100 [00:31<00:50, 1.25it/s]
evaluating Epoch: 38%|[32m███▊ [0m| 38/100 [00:31<00:49, 1.26it/s]
evaluating Epoch: 38%|[32m███▊ [0m| 38/100 [00:30<00:49, 1.26it/s]
evaluating Epoch: 39%|[32m███▉ [0m| 39/100 [00:31<00:50, 1.21it/s]
evaluating Epoch: 39%|[32m███▉ [0m| 39/100 [00:32<00:50, 1.21it/s]
evaluating Epoch: 40%|[32m████ [0m| 40/100 [00:32<00:50, 1.18it/s]
evaluating Epoch: 40%|[32m████ [0m| 40/100 [00:33<00:50, 1.18it/s]
evaluating Epoch: 41%|[32m████ [0m| 41/100 [00:34<00:51, 1.15it/s]
evaluating Epoch: 41%|[32m████ [0m| 41/100 [00:33<00:51, 1.15it/s]
evaluating Epoch: 42%|[32m████� [0m| 42/100 [00:35<00:51, 1.12it/s]
evaluating Epoch: 42%|[32m████� [0m| 42/100 [00:34<00:51, 1.12it/s]
evaluating Epoch: 43%|[32m████▎ [0m| 43/100 [00:36<00:51, 1.12it/s]
evaluating Epoch: 43%|[32m████▎ [0m| 43/100 [00:35<00:51, 1.12it/s]
evaluating Epoch: 44%|[32m████� [0m| 44/100 [00:36<00:50, 1.12it/s]
evaluating Epoch: 44%|[32m████� [0m| 44/100 [00:37<00:50, 1.11it/s]
evaluating Epoch: 45%|[32m████▌ [0m| 45/100 [00:37<00:49, 1.11it/s]
evaluating Epoch: 45%|[32m████▌ [0m| 45/100 [00:38<00:49, 1.11it/s]
evaluating Epoch: 46%|[32m████▌ [0m| 46/100 [00:38<00:49, 1.10it/s]
evaluating Epoch: 46%|[32m████▌ [0m| 46/100 [00:39<00:49, 1.10it/s]
evaluating Epoch: 47%|[32m████▋ [0m| 47/100 [00:40<00:48, 1.09it/s]
evaluating Epoch: 47%|[32m████▋ [0m| 47/100 [00:38<00:48, 1.09it/s]
evaluating Epoch: 48%|[32m████▊ [0m| 48/100 [00:39<00:47, 1.10it/s]
evaluating Epoch: 48%|[32m████▊ [0m| 48/100 [00:40<00:47, 1.10it/s]
evaluating Epoch: 49%|[32m████▉ [0m| 49/100 [00:40<00:46, 1.10it/s]
evaluating Epoch: 49%|[32m████▉ [0m| 49/100 [00:41<00:46, 1.10it/s]
evaluating Epoch: 50%|[32m█████ [0m| 50/100 [00:41<00:45, 1.11it/s]
evaluating Epoch: 50%|[32m█████ [0m| 50/100 [00:42<00:45, 1.11it/s]
evaluating Epoch: 51%|[32m█████ [0m| 51/100 [00:43<00:44, 1.10it/s]
evaluating Epoch: 51%|[32m█████ [0m| 51/100 [00:42<00:44, 1.10it/s]
evaluating Epoch: 52%|[32m█████� [0m| 52/100 [00:43<00:43, 1.11it/s]
evaluating Epoch: 52%|[32m█████� [0m| 52/100 [00:44<00:43, 1.10it/s]
evaluating Epoch: 53%|[32m█████▎ [0m| 53/100 [00:44<00:42, 1.11it/s]
evaluating Epoch: 53%|[32m█████▎ [0m| 53/100 [00:45<00:42, 1.11it/s]
evaluating Epoch: 54%|[32m█████� [0m| 54/100 [00:45<00:41, 1.11it/s]
evaluating Epoch: 54%|[32m█████� [0m| 54/100 [00:46<00:41, 1.11it/s]
evaluating Epoch: 55%|[32m█████▌ [0m| 55/100 [00:46<00:40, 1.11it/s]
evaluating Epoch: 55%|[32m█████▌ [0m| 55/100 [00:47<00:40, 1.11it/s]
evaluating Epoch: 56%|[32m█████▌ [0m| 56/100 [00:47<00:39, 1.11it/s]
evaluating Epoch: 56%|[32m█████▌ [0m| 56/100 [00:48<00:39, 1.11it/s]
evaluating Epoch: 57%|[32m█████▋ [0m| 57/100 [00:49<00:38, 1.11it/s]
evaluating Epoch: 57%|[32m█████▋ [0m| 57/100 [00:47<00:38, 1.11it/s]
evaluating Epoch: 58%|[32m█████▊ [0m| 58/100 [00:48<00:38, 1.10it/s]
evaluating Epoch: 58%|[32m█████▊ [0m| 58/100 [00:49<00:38, 1.10it/s]
evaluating Epoch: 59%|[32m█████▉ [0m| 59/100 [00:50<00:37, 1.09it/s]
evaluating Epoch: 59%|[32m█████▉ [0m| 59/100 [00:49<00:37, 1.09it/s]
evaluating Epoch: 60%|[32m██████ [0m| 60/100 [00:50<00:36, 1.09it/s]
evaluating Epoch: 60%|[32m██████ [0m| 60/100 [00:51<00:36, 1.09it/s]
evaluating Epoch: 61%|[32m██████ [0m| 61/100 [00:51<00:35, 1.10it/s]
evaluating Epoch: 61%|[32m██████ [0m| 61/100 [00:52<00:35, 1.10it/s]
evaluating Epoch: 62%|[32m██████� [0m| 62/100 [00:53<00:34, 1.11it/s]
evaluating Epoch: 62%|[32m██████� [0m| 62/100 [00:52<00:34, 1.11it/s]
evaluating Epoch: 63%|[32m██████▎ [0m| 63/100 [00:53<00:33, 1.11it/s]
evaluating Epoch: 63%|[32m██████▎ [0m| 63/100 [00:54<00:33, 1.11it/s]
evaluating Epoch: 64%|[32m██████� [0m| 64/100 [00:54<00:32, 1.11it/s]
evaluating Epoch: 64%|[32m██████� [0m| 64/100 [00:55<00:32, 1.11it/s]
evaluating Epoch: 65%|[32m██████▌ [0m| 65/100 [00:55<00:31, 1.11it/s]
evaluating Epoch: 65%|[32m██████▌ [0m| 65/100 [00:56<00:31, 1.11it/s]
evaluating Epoch: 66%|[32m██████▌ [0m| 66/100 [00:56<00:30, 1.11it/s]
evaluating Epoch: 66%|[32m██████▌ [0m| 66/100 [00:57<00:30, 1.11it/s]
evaluating Epoch: 67%|[32m██████▋ [0m| 67/100 [00:58<00:29, 1.11it/s]
evaluating Epoch: 67%|[32m██████▋ [0m| 67/100 [00:57<00:29, 1.11it/s]
evaluating Epoch: 68%|[32m██████▊ [0m| 68/100 [00:57<00:28, 1.11it/s]
evaluating Epoch: 68%|[32m██████▊ [0m| 68/100 [00:59<00:28, 1.11it/s]
evaluating Epoch: 69%|[32m██████▉ [0m| 69/100 [00:58<00:28, 1.10it/s]
evaluating Epoch: 69%|[32m██████▉ [0m| 69/100 [00:59<00:28, 1.10it/s]
evaluating Epoch: 70%|[32m███████ [0m| 70/100 [01:00<00:27, 1.11it/s]
evaluating Epoch: 70%|[32m███████ [0m| 70/100 [00:59<00:27, 1.11it/s]
evaluating Epoch: 71%|[32m███████ [0m| 71/100 [01:00<00:26, 1.10it/s]
evaluating Epoch: 71%|[32m███████ [0m| 71/100 [01:01<00:26, 1.10it/s]
evaluating Epoch: 72%|[32m███████� [0m| 72/100 [01:01<00:25, 1.11it/s]
evaluating Epoch: 72%|[32m███████� [0m| 72/100 [01:02<00:25, 1.11it/s]
evaluating Epoch: 73%|[32m███████▎ [0m| 73/100 [01:03<00:24, 1.09it/s]
evaluating Epoch: 73%|[32m███████▎ [0m| 73/100 [01:02<00:24, 1.09it/s]
evaluating Epoch: 74%|[32m███████� [0m| 74/100 [01:04<00:23, 1.09it/s]
evaluating Epoch: 74%|[32m███████� [0m| 74/100 [01:03<00:23, 1.08it/s]
evaluating Epoch: 75%|[32m███████▌ [0m| 75/100 [01:05<00:23, 1.08it/s]
evaluating Epoch: 75%|[32m███████▌ [0m| 75/100 [01:04<00:23, 1.07it/s]
evaluating Epoch: 76%|[32m███████▌ [0m| 76/100 [01:05<00:22, 1.07it/s]
evaluating Epoch: 76%|[32m███████▌ [0m| 76/100 [01:06<00:22, 1.07it/s]
evaluating Epoch: 77%|[32m███████▋ [0m| 77/100 [01:07<00:21, 1.08it/s]
evaluating Epoch: 77%|[32m███████▋ [0m| 77/100 [01:06<00:21, 1.08it/s]
evaluating Epoch: 78%|[32m███████▊ [0m| 78/100 [01:08<00:20, 1.09it/s]
evaluating Epoch: 78%|[32m███████▊ [0m| 78/100 [01:07<00:20, 1.09it/s]
evaluating Epoch: 79%|[32m███████▉ [0m| 79/100 [01:09<00:19, 1.09it/s]
evaluating Epoch: 79%|[32m███████▉ [0m| 79/100 [01:08<00:19, 1.09it/s]
evaluating Epoch: 80%|[32m████████ [0m| 80/100 [01:08<00:18, 1.10it/s]
evaluating Epoch: 80%|[32m████████ [0m| 80/100 [01:10<00:18, 1.10it/s]
evaluating Epoch: 81%|[32m████████ [0m| 81/100 [01:09<00:17, 1.10it/s]
evaluating Epoch: 81%|[32m████████ [0m| 81/100 [01:10<00:17, 1.10it/s]
evaluating Epoch: 82%|[32m████████� [0m| 82/100 [01:10<00:16, 1.10it/s]
evaluating Epoch: 82%|[32m████████� [0m| 82/100 [01:11<00:16, 1.10it/s]
evaluating Epoch: 83%|[32m████████▎ [0m| 83/100 [01:12<00:15, 1.10it/s]
evaluating Epoch: 83%|[32m████████▎ [0m| 83/100 [01:11<00:15, 1.10it/s]
evaluating Epoch: 84%|[32m████████� [0m| 84/100 [01:12<00:14, 1.10it/s]
evaluating Epoch: 84%|[32m████████� [0m| 84/100 [01:13<00:14, 1.10it/s]
evaluating Epoch: 85%|[32m████████▌ [0m| 85/100 [01:14<00:13, 1.10it/s]
evaluating Epoch: 85%|[32m████████▌ [0m| 85/100 [01:13<00:13, 1.09it/s]
evaluating Epoch: 86%|[32m████████▌ [0m| 86/100 [01:14<00:12, 1.10it/s]
evaluating Epoch: 86%|[32m████████▌ [0m| 86/100 [01:15<00:12, 1.10it/s]
evaluating Epoch: 87%|[32m████████▋ [0m| 87/100 [01:16<00:11, 1.10it/s]
evaluating Epoch: 87%|[32m████████▋ [0m| 87/100 [01:15<00:11, 1.10it/s]
evaluating Epoch: 88%|[32m████████▊ [0m| 88/100 [01:16<00:10, 1.10it/s]
evaluating Epoch: 88%|[32m████████▊ [0m| 88/100 [01:17<00:10, 1.10it/s]
evaluating Epoch: 89%|[32m████████▉ [0m| 89/100 [01:17<00:09, 1.10it/s]
evaluating Epoch: 89%|[32m████████▉ [0m| 89/100 [01:18<00:10, 1.10it/s]
evaluating Epoch: 90%|[32m█████████ [0m| 90/100 [01:18<00:09, 1.10it/s]
evaluating Epoch: 90%|[32m█████████ [0m| 90/100 [01:19<00:09, 1.10it/s]
evaluating Epoch: 91%|[32m█████████ [0m| 91/100 [01:18<00:08, 1.10it/s]
evaluating Epoch: 91%|[32m█████████ [0m| 91/100 [01:20<00:08, 1.10it/s]
evaluating Epoch: 92%|[32m█████████�[0m| 92/100 [01:20<00:07, 1.10it/s]
evaluating Epoch: 92%|[32m█████████�[0m| 92/100 [01:19<00:07, 1.10it/s]
evaluating Epoch: 93%|[32m█████████▎[0m| 93/100 [01:21<00:06, 1.10it/s]
evaluating Epoch: 93%|[32m█████████▎[0m| 93/100 [01:20<00:06, 1.10it/s]
evaluating Epoch: 94%|[32m█████████�[0m| 94/100 [01:21<00:05, 1.11it/s]
evaluating Epoch: 94%|[32m█████████�[0m| 94/100 [01:22<00:05, 1.11it/s]
evaluating Epoch: 95%|[32m█████████▌[0m| 95/100 [01:22<00:04, 1.11it/s]
evaluating Epoch: 95%|[32m█████████▌[0m| 95/100 [01:23<00:04, 1.11it/s]
evaluating Epoch: 96%|[32m█████████▌[0m| 96/100 [01:23<00:03, 1.11it/s]
evaluating Epoch: 96%|[32m█████████▌[0m| 96/100 [01:24<00:03, 1.11it/s]
evaluating Epoch: 97%|[32m█████████▋[0m| 97/100 [01:24<00:02, 1.11it/s]
evaluating Epoch: 97%|[32m█████████▋[0m| 97/100 [01:25<00:02, 1.10it/s]
evaluating Epoch: 98%|[32m█████████▊[0m| 98/100 [01:26<00:01, 1.09it/s]
evaluating Epoch: 98%|[32m█████████▊[0m| 98/100 [01:25<00:01, 1.09it/s]
evaluating Epoch: 99%|[32m█████████▉[0m| 99/100 [01:26<00:00, 1.09it/s]
evaluating Epoch: 99%|[32m█████████▉[0m| 99/100 [01:27<00:00, 1.09it/s]
evaluating Epoch: 100%|[32m██████████[0m| 100/100 [01:27<00:00, 1.08it/s]
evaluating Epoch: 100%|[32m██████████[0m| 100/100 [01:28<00:00, 1.08it/s]
evaluating Epoch: 100%|[32m██████████[0m| 100/100 [01:27<00:00, 1.15it/s]
evaluating Epoch: 100%|[32m██████████[0m| 100/100 [01:28<00:00, 1.13it/s]
eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')
Epoch 1: train_perplexity=2.8321, train_epoch_loss=1.0410, epoch time 406.24218282848597s
Key: avg_train_prep, Value: 2.8321006298065186
Key: avg_train_loss, Value: 1.0410187244415283
Key: avg_eval_prep, Value: nan
Key: avg_eval_loss, Value: inf
Key: avg_epoch_time, Value: 406.24218282848597
Key: avg_checkpoint_time, Value: 7.697194814682007e-05
Yes, your eval loss is NaN so no checkpoint gets saved:
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00, 1.13it/s]
eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')
Your checkpoint file also seems to be corrupted as there are weight missing:
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', ...'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Seems like you're using an older version of llama-recipes as you're using micro_batch_size which was removed recently. Please update to the latest version to make sure you have all current fixes in.
Yes, your eval loss is NaN so no checkpoint gets saved:
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00, 1.13it/s] eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')
Your checkpoint file also seems to be corrupted as there are weight missing:
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', ...'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Seems like you're using an older version of llama-recipes as you're using micro_batch_size which was removed recently. Please update to the latest version to make sure you have all current fixes in.
So why would my eval loss be NaN? Is there a problem with the dataset I used for training? Or a problem with my parameters?
Can have many reasons. Are you using the original alpaca json or a modification? Did you figure out why some weights are not initialized?
Can have many reasons. Are you using the original alpaca json or a modification? Did you figure out why some weights are not initialized?
I am using the original alpaca JSON. The reason why some weights were not initialized may be that I am using the Codellama model, and this fine-tuning code is for llama. Therefore, I subsequently used the llama2 model for fine-tuning, and some warnings that the weights were not initialized disappeared, but it still cannot solve the problem of eval loss being NaN.
there are some issues #146, with the setting max_words in alpaca dataset, we are looking into fixing it.
Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.
Here is my log:
WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2023-09-18 08:00:02,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-18 08:00:02,256] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) Clearing GPU cache for all ranks --> Running with torch dist debug set to detail Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.31s/it] Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.38s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:10<00:05, 5.19s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:10<00:05, 5.22s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.16s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.46s/it] Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.22s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.50s/it] Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. --> Model /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf --> /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf has 13016.02816 Million params The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. The class this function is called from is 'LlamaTokenizer'. The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. The class this function is called from is 'LlamaTokenizer'. trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306 bFloat16 enabled for mixed precision - using bfSixteen policy trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306 --> applying fsdp activation checkpointing... --> Training Set Length = 6233 --> Validation Set Length = 200 --> applying fsdp activation checkpointing... /GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( Training Epoch: 1: 0%|�[34m �[0m| 0/12 [00:00<?, ?it/s]/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( Training Epoch: 1: 0%|�[34m �[0m| 0/12 [00:00<?, ?it/s] Training Epoch: 1: 8%|�[34m▊ �[0m| 1/12 [00:36<06:40, 36.40s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002): 8%|�[34m▊ �[0m| 1/12 [00:36<06:40, 36.40s/it] Training Epoch: 1: 8%|�[34m▊ �[0m| 1/12 [00:36<06:45, 36.86s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802): 8%|�[34m▊ �[0m| 1/12 [00:36<06:45, 36.86s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:44, 34.42s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:44, 34.42s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:42, 34.26s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:42, 34.26s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143): 25%|�[34m██▌ �[0m| 3/12 [01:44<05:14, 34.94s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165): 25%|�[34m██▌ �[0m| 3/12 [01:44<05:14, 34.94s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772): 25%|�[34m██▌ �[0m| 3/12 [01:45<05:15, 35.05s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344): 25%|�[34m██▌ �[0m| 3/12 [01:45<05:15, 35.05s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344): 33%|�[34m███▎ �[0m| 4/12 [02:18<04:33, 34.20s/it] Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268): 33%|�[34m███▎ �[0m| 4/12 [02:18<04:33, 34.20s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165): 33%|�[34m███▎ �[0m| 4/12 [02:17<04:33, 34.14s/it] Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208): 33%|�[34m███▎ �[0m| 4/12 [02:17<04:33, 34.14s/it] Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.82s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.82s/it] Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.85s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.85s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.62s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.62s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.64s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.64s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.46s/it] Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.46s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.45s/it] Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.45s/it] Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674): 67%|�[34m██████▋ �[0m| 8/12 [04:31<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295): 67%|�[34m██████▋ �[0m| 8/12 [04:31<02:13, 33.38s/it] Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637): 67%|�[34m██████▋ �[0m| 8/12 [04:30<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533): 67%|�[34m██████▋ �[0m| 8/12 [04:30<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.39s/it] Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.39s/it] Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923): 92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906): 92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.72s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.69s/it] Max CUDA memory allocated was 58 GB Max CUDA memory reserved was 77 GB Peak active CUDA memory was 58 GB Cuda Malloc retires : 35 CPU Total Peak Memory consumed during the train (max): 3 GB evaluating Epoch: 0%|�[32m �[0m| 0/100 [00:00<?, ?it/s] evaluating Epoch: 0%|�[32m �[0m| 0/100 [00:00<?, ?it/s] evaluating Epoch: 1%|�[32m �[0m| 1/100 [00:02<03:25, 2.08s/it] evaluating Epoch: 1%|�[32m �[0m| 1/100 [00:01<01:39, 1.00s/it] evaluating Epoch: 2%|�[32m� �[0m| 2/100 [00:01<01:26, 1.13it/s] evaluating Epoch: 2%|�[32m� �[0m| 2/100 [00:02<02:10, 1.33s/it] evaluating Epoch: 3%|�[32m▎ �[0m| 3/100 [00:03<01:45, 1.09s/it] evaluating Epoch: 3%|�[32m▎ �[0m| 3/100 [00:02<01:22, 1.18it/s] evaluating Epoch: 4%|�[32m� �[0m| 4/100 [00:04<01:33, 1.02it/s] evaluating Epoch: 4%|�[32m� �[0m| 4/100 [00:03<01:19, 1.20it/s] evaluating Epoch: 5%|�[32m▌ �[0m| 5/100 [00:04<01:18, 1.22it/s] evaluating Epoch: 5%|�[32m▌ �[0m| 5/100 [00:05<01:27, 1.09it/s] evaluating Epoch: 6%|�[32m▌ �[0m| 6/100 [00:05<01:16, 1.22it/s] evaluating Epoch: 6%|�[32m▌ �[0m| 6/100 [00:06<01:22, 1.14it/s] evaluating Epoch: 7%|�[32m▋ �[0m| 7/100 [00:05<01:15, 1.23it/s] evaluating Epoch: 7%|�[32m▋ �[0m| 7/100 [00:06<01:19, 1.17it/s] evaluating Epoch: 8%|�[32m▊ �[0m| 8/100 [00:06<01:14, 1.23it/s] evaluating Epoch: 8%|�[32m▊ �[0m| 8/100 [00:07<01:17, 1.19it/s] evaluating Epoch: 9%|�[32m▉ �[0m| 9/100 [00:07<01:13, 1.24it/s] evaluating Epoch: 9%|�[32m▉ �[0m| 9/100 [00:08<01:15, 1.21it/s] evaluating Epoch: 10%|�[32m█ �[0m| 10/100 [00:09<01:13, 1.22it/s] evaluating Epoch: 10%|�[32m█ �[0m| 10/100 [00:08<01:12, 1.24it/s] evaluating Epoch: 11%|�[32m█ �[0m| 11/100 [00:10<01:12, 1.23it/s] evaluating Epoch: 11%|�[32m█ �[0m| 11/100 [00:09<01:11, 1.24it/s] evaluating Epoch: 12%|�[32m█� �[0m| 12/100 [00:09<01:10, 1.24it/s] evaluating Epoch: 12%|�[32m█� �[0m| 12/100 [00:10<01:11, 1.23it/s] evaluating Epoch: 13%|�[32m█▎ �[0m| 13/100 [00:11<01:10, 1.24it/s] evaluating Epoch: 13%|�[32m█▎ �[0m| 13/100 [00:10<01:09, 1.24it/s] evaluating Epoch: 14%|�[32m█� �[0m| 14/100 [00:12<01:09, 1.24it/s] evaluating Epoch: 14%|�[32m█� �[0m| 14/100 [00:11<01:09, 1.25it/s] evaluating Epoch: 15%|�[32m█▌ �[0m| 15/100 [00:13<01:08, 1.24it/s] evaluating Epoch: 15%|�[32m█▌ �[0m| 15/100 [00:12<01:08, 1.24it/s] evaluating Epoch: 16%|�[32m█▌ �[0m| 16/100 [00:13<01:07, 1.25it/s] evaluating Epoch: 16%|�[32m█▌ �[0m| 16/100 [00:14<01:07, 1.24it/s] evaluating Epoch: 17%|�[32m█▋ �[0m| 17/100 [00:13<01:07, 1.24it/s] evaluating Epoch: 17%|�[32m█▋ �[0m| 17/100 [00:14<01:07, 1.23it/s] evaluating Epoch: 18%|�[32m█▊ �[0m| 18/100 [00:14<01:06, 1.24it/s] evaluating Epoch: 18%|�[32m█▊ �[0m| 18/100 [00:15<01:06, 1.24it/s] evaluating Epoch: 19%|�[32m█▉ �[0m| 19/100 [00:16<01:05, 1.24it/s] evaluating Epoch: 19%|�[32m█▉ �[0m| 19/100 [00:15<01:05, 1.24it/s] evaluating Epoch: 20%|�[32m██ �[0m| 20/100 [00:16<01:04, 1.24it/s] evaluating Epoch: 20%|�[32m██ �[0m| 20/100 [00:17<01:04, 1.24it/s] evaluating Epoch: 21%|�[32m██ �[0m| 21/100 [00:17<01:03, 1.24it/s] evaluating Epoch: 21%|�[32m██ �[0m| 21/100 [00:18<01:03, 1.24it/s] evaluating Epoch: 22%|�[32m██� �[0m| 22/100 [00:17<01:02, 1.24it/s] evaluating Epoch: 22%|�[32m██� �[0m| 22/100 [00:18<01:02, 1.24it/s] evaluating Epoch: 23%|�[32m██▎ �[0m| 23/100 [00:19<01:02, 1.24it/s] evaluating Epoch: 23%|�[32m██▎ �[0m| 23/100 [00:18<01:02, 1.24it/s] evaluating Epoch: 24%|�[32m██� �[0m| 24/100 [00:20<01:01, 1.24it/s] evaluating Epoch: 24%|�[32m██� �[0m| 24/100 [00:19<01:01, 1.24it/s] evaluating Epoch: 25%|�[32m██▌ �[0m| 25/100 [00:21<01:00, 1.24it/s] evaluating Epoch: 25%|�[32m██▌ �[0m| 25/100 [00:20<01:00, 1.24it/s] evaluating Epoch: 26%|�[32m██▌ �[0m| 26/100 [00:22<00:59, 1.24it/s] evaluating Epoch: 26%|�[32m██▌ �[0m| 26/100 [00:21<00:59, 1.24it/s] evaluating Epoch: 27%|�[32m██▋ �[0m| 27/100 [00:21<00:58, 1.24it/s] evaluating Epoch: 27%|�[32m██▋ �[0m| 27/100 [00:23<00:58, 1.24it/s] evaluating Epoch: 28%|�[32m██▊ �[0m| 28/100 [00:22<00:57, 1.24it/s] evaluating Epoch: 28%|�[32m██▊ �[0m| 28/100 [00:23<00:57, 1.24it/s] evaluating Epoch: 29%|�[32m██▉ �[0m| 29/100 [00:23<00:57, 1.24it/s] evaluating Epoch: 29%|�[32m██▉ �[0m| 29/100 [00:24<00:57, 1.24it/s] evaluating Epoch: 30%|�[32m███ �[0m| 30/100 [00:25<00:56, 1.24it/s] evaluating Epoch: 30%|�[32m███ �[0m| 30/100 [00:24<00:56, 1.24it/s] evaluating Epoch: 31%|�[32m███ �[0m| 31/100 [00:25<00:55, 1.24it/s] evaluating Epoch: 31%|�[32m███ �[0m| 31/100 [00:26<00:55, 1.24it/s] evaluating Epoch: 32%|�[32m███� �[0m| 32/100 [00:27<00:54, 1.25it/s] evaluating Epoch: 32%|�[32m███� �[0m| 32/100 [00:25<00:54, 1.24it/s] evaluating Epoch: 33%|�[32m███▎ �[0m| 33/100 [00:26<00:53, 1.24it/s] evaluating Epoch: 33%|�[32m███▎ �[0m| 33/100 [00:27<00:53, 1.24it/s] evaluating Epoch: 34%|�[32m███� �[0m| 34/100 [00:28<00:52, 1.25it/s] evaluating Epoch: 34%|�[32m███� �[0m| 34/100 [00:27<00:52, 1.25it/s] evaluating Epoch: 35%|�[32m███▌ �[0m| 35/100 [00:28<00:52, 1.24it/s] evaluating Epoch: 35%|�[32m███▌ �[0m| 35/100 [00:29<00:52, 1.24it/s] evaluating Epoch: 36%|�[32m███▌ �[0m| 36/100 [00:30<00:51, 1.25it/s] evaluating Epoch: 36%|�[32m███▌ �[0m| 36/100 [00:29<00:51, 1.25it/s] evaluating Epoch: 37%|�[32m███▋ �[0m| 37/100 [00:29<00:50, 1.25it/s] evaluating Epoch: 37%|�[32m███▋ �[0m| 37/100 [00:31<00:50, 1.25it/s] evaluating Epoch: 38%|�[32m███▊ �[0m| 38/100 [00:31<00:49, 1.26it/s] evaluating Epoch: 38%|�[32m███▊ �[0m| 38/100 [00:30<00:49, 1.26it/s] evaluating Epoch: 39%|�[32m███▉ �[0m| 39/100 [00:31<00:50, 1.21it/s] evaluating Epoch: 39%|�[32m███▉ �[0m| 39/100 [00:32<00:50, 1.21it/s] evaluating Epoch: 40%|�[32m████ �[0m| 40/100 [00:32<00:50, 1.18it/s] evaluating Epoch: 40%|�[32m████ �[0m| 40/100 [00:33<00:50, 1.18it/s] evaluating Epoch: 41%|�[32m████ �[0m| 41/100 [00:34<00:51, 1.15it/s] evaluating Epoch: 41%|�[32m████ �[0m| 41/100 [00:33<00:51, 1.15it/s] evaluating Epoch: 42%|�[32m████� �[0m| 42/100 [00:35<00:51, 1.12it/s] evaluating Epoch: 42%|�[32m████� �[0m| 42/100 [00:34<00:51, 1.12it/s] evaluating Epoch: 43%|�[32m████▎ �[0m| 43/100 [00:36<00:51, 1.12it/s] evaluating Epoch: 43%|�[32m████▎ �[0m| 43/100 [00:35<00:51, 1.12it/s] evaluating Epoch: 44%|�[32m████� �[0m| 44/100 [00:36<00:50, 1.12it/s] evaluating Epoch: 44%|�[32m████� �[0m| 44/100 [00:37<00:50, 1.11it/s] evaluating Epoch: 45%|�[32m████▌ �[0m| 45/100 [00:37<00:49, 1.11it/s] evaluating Epoch: 45%|�[32m████▌ �[0m| 45/100 [00:38<00:49, 1.11it/s] evaluating Epoch: 46%|�[32m████▌ �[0m| 46/100 [00:38<00:49, 1.10it/s] evaluating Epoch: 46%|�[32m████▌ �[0m| 46/100 [00:39<00:49, 1.10it/s] evaluating Epoch: 47%|�[32m████▋ �[0m| 47/100 [00:40<00:48, 1.09it/s] evaluating Epoch: 47%|�[32m████▋ �[0m| 47/100 [00:38<00:48, 1.09it/s] evaluating Epoch: 48%|�[32m████▊ �[0m| 48/100 [00:39<00:47, 1.10it/s] evaluating Epoch: 48%|�[32m████▊ �[0m| 48/100 [00:40<00:47, 1.10it/s] evaluating Epoch: 49%|�[32m████▉ �[0m| 49/100 [00:40<00:46, 1.10it/s] evaluating Epoch: 49%|�[32m████▉ �[0m| 49/100 [00:41<00:46, 1.10it/s] evaluating Epoch: 50%|�[32m█████ �[0m| 50/100 [00:41<00:45, 1.11it/s] evaluating Epoch: 50%|�[32m█████ �[0m| 50/100 [00:42<00:45, 1.11it/s] evaluating Epoch: 51%|�[32m█████ �[0m| 51/100 [00:43<00:44, 1.10it/s] evaluating Epoch: 51%|�[32m█████ �[0m| 51/100 [00:42<00:44, 1.10it/s] evaluating Epoch: 52%|�[32m█████� �[0m| 52/100 [00:43<00:43, 1.11it/s] evaluating Epoch: 52%|�[32m█████� �[0m| 52/100 [00:44<00:43, 1.10it/s] evaluating Epoch: 53%|�[32m█████▎ �[0m| 53/100 [00:44<00:42, 1.11it/s] evaluating Epoch: 53%|�[32m█████▎ �[0m| 53/100 [00:45<00:42, 1.11it/s] evaluating Epoch: 54%|�[32m█████� �[0m| 54/100 [00:45<00:41, 1.11it/s] evaluating Epoch: 54%|�[32m█████� �[0m| 54/100 [00:46<00:41, 1.11it/s] evaluating Epoch: 55%|�[32m█████▌ �[0m| 55/100 [00:46<00:40, 1.11it/s] evaluating Epoch: 55%|�[32m█████▌ �[0m| 55/100 [00:47<00:40, 1.11it/s] evaluating Epoch: 56%|�[32m█████▌ �[0m| 56/100 [00:47<00:39, 1.11it/s] evaluating Epoch: 56%|�[32m█████▌ �[0m| 56/100 [00:48<00:39, 1.11it/s] evaluating Epoch: 57%|�[32m█████▋ �[0m| 57/100 [00:49<00:38, 1.11it/s] evaluating Epoch: 57%|�[32m█████▋ �[0m| 57/100 [00:47<00:38, 1.11it/s] evaluating Epoch: 58%|�[32m█████▊ �[0m| 58/100 [00:48<00:38, 1.10it/s] evaluating Epoch: 58%|�[32m█████▊ �[0m| 58/100 [00:49<00:38, 1.10it/s] evaluating Epoch: 59%|�[32m█████▉ �[0m| 59/100 [00:50<00:37, 1.09it/s] evaluating Epoch: 59%|�[32m█████▉ �[0m| 59/100 [00:49<00:37, 1.09it/s] evaluating Epoch: 60%|�[32m██████ �[0m| 60/100 [00:50<00:36, 1.09it/s] evaluating Epoch: 60%|�[32m██████ �[0m| 60/100 [00:51<00:36, 1.09it/s] evaluating Epoch: 61%|�[32m██████ �[0m| 61/100 [00:51<00:35, 1.10it/s] evaluating Epoch: 61%|�[32m██████ �[0m| 61/100 [00:52<00:35, 1.10it/s] evaluating Epoch: 62%|�[32m██████� �[0m| 62/100 [00:53<00:34, 1.11it/s] evaluating Epoch: 62%|�[32m██████� �[0m| 62/100 [00:52<00:34, 1.11it/s] evaluating Epoch: 63%|�[32m██████▎ �[0m| 63/100 [00:53<00:33, 1.11it/s] evaluating Epoch: 63%|�[32m██████▎ �[0m| 63/100 [00:54<00:33, 1.11it/s] evaluating Epoch: 64%|�[32m██████� �[0m| 64/100 [00:54<00:32, 1.11it/s] evaluating Epoch: 64%|�[32m██████� �[0m| 64/100 [00:55<00:32, 1.11it/s] evaluating Epoch: 65%|�[32m██████▌ �[0m| 65/100 [00:55<00:31, 1.11it/s] evaluating Epoch: 65%|�[32m██████▌ �[0m| 65/100 [00:56<00:31, 1.11it/s] evaluating Epoch: 66%|�[32m██████▌ �[0m| 66/100 [00:56<00:30, 1.11it/s] evaluating Epoch: 66%|�[32m██████▌ �[0m| 66/100 [00:57<00:30, 1.11it/s] evaluating Epoch: 67%|�[32m██████▋ �[0m| 67/100 [00:58<00:29, 1.11it/s] evaluating Epoch: 67%|�[32m██████▋ �[0m| 67/100 [00:57<00:29, 1.11it/s] evaluating Epoch: 68%|�[32m██████▊ �[0m| 68/100 [00:57<00:28, 1.11it/s] evaluating Epoch: 68%|�[32m██████▊ �[0m| 68/100 [00:59<00:28, 1.11it/s] evaluating Epoch: 69%|�[32m██████▉ �[0m| 69/100 [00:58<00:28, 1.10it/s] evaluating Epoch: 69%|�[32m██████▉ �[0m| 69/100 [00:59<00:28, 1.10it/s] evaluating Epoch: 70%|�[32m███████ �[0m| 70/100 [01:00<00:27, 1.11it/s] evaluating Epoch: 70%|�[32m███████ �[0m| 70/100 [00:59<00:27, 1.11it/s] evaluating Epoch: 71%|�[32m███████ �[0m| 71/100 [01:00<00:26, 1.10it/s] evaluating Epoch: 71%|�[32m███████ �[0m| 71/100 [01:01<00:26, 1.10it/s] evaluating Epoch: 72%|�[32m███████� �[0m| 72/100 [01:01<00:25, 1.11it/s] evaluating Epoch: 72%|�[32m███████� �[0m| 72/100 [01:02<00:25, 1.11it/s] evaluating Epoch: 73%|�[32m███████▎ �[0m| 73/100 [01:03<00:24, 1.09it/s] evaluating Epoch: 73%|�[32m███████▎ �[0m| 73/100 [01:02<00:24, 1.09it/s] evaluating Epoch: 74%|�[32m███████� �[0m| 74/100 [01:04<00:23, 1.09it/s] evaluating Epoch: 74%|�[32m███████� �[0m| 74/100 [01:03<00:23, 1.08it/s] evaluating Epoch: 75%|�[32m███████▌ �[0m| 75/100 [01:05<00:23, 1.08it/s] evaluating Epoch: 75%|�[32m███████▌ �[0m| 75/100 [01:04<00:23, 1.07it/s] evaluating Epoch: 76%|�[32m███████▌ �[0m| 76/100 [01:05<00:22, 1.07it/s] evaluating Epoch: 76%|�[32m███████▌ �[0m| 76/100 [01:06<00:22, 1.07it/s] evaluating Epoch: 77%|�[32m███████▋ �[0m| 77/100 [01:07<00:21, 1.08it/s] evaluating Epoch: 77%|�[32m███████▋ �[0m| 77/100 [01:06<00:21, 1.08it/s] evaluating Epoch: 78%|�[32m███████▊ �[0m| 78/100 [01:08<00:20, 1.09it/s] evaluating Epoch: 78%|�[32m███████▊ �[0m| 78/100 [01:07<00:20, 1.09it/s] evaluating Epoch: 79%|�[32m███████▉ �[0m| 79/100 [01:09<00:19, 1.09it/s] evaluating Epoch: 79%|�[32m███████▉ �[0m| 79/100 [01:08<00:19, 1.09it/s] evaluating Epoch: 80%|�[32m████████ �[0m| 80/100 [01:08<00:18, 1.10it/s] evaluating Epoch: 80%|�[32m████████ �[0m| 80/100 [01:10<00:18, 1.10it/s] evaluating Epoch: 81%|�[32m████████ �[0m| 81/100 [01:09<00:17, 1.10it/s] evaluating Epoch: 81%|�[32m████████ �[0m| 81/100 [01:10<00:17, 1.10it/s] evaluating Epoch: 82%|�[32m████████� �[0m| 82/100 [01:10<00:16, 1.10it/s] evaluating Epoch: 82%|�[32m████████� �[0m| 82/100 [01:11<00:16, 1.10it/s] evaluating Epoch: 83%|�[32m████████▎ �[0m| 83/100 [01:12<00:15, 1.10it/s] evaluating Epoch: 83%|�[32m████████▎ �[0m| 83/100 [01:11<00:15, 1.10it/s] evaluating Epoch: 84%|�[32m████████� �[0m| 84/100 [01:12<00:14, 1.10it/s] evaluating Epoch: 84%|�[32m████████� �[0m| 84/100 [01:13<00:14, 1.10it/s] evaluating Epoch: 85%|�[32m████████▌ �[0m| 85/100 [01:14<00:13, 1.10it/s] evaluating Epoch: 85%|�[32m████████▌ �[0m| 85/100 [01:13<00:13, 1.09it/s] evaluating Epoch: 86%|�[32m████████▌ �[0m| 86/100 [01:14<00:12, 1.10it/s] evaluating Epoch: 86%|�[32m████████▌ �[0m| 86/100 [01:15<00:12, 1.10it/s] evaluating Epoch: 87%|�[32m████████▋ �[0m| 87/100 [01:16<00:11, 1.10it/s] evaluating Epoch: 87%|�[32m████████▋ �[0m| 87/100 [01:15<00:11, 1.10it/s] evaluating Epoch: 88%|�[32m████████▊ �[0m| 88/100 [01:16<00:10, 1.10it/s] evaluating Epoch: 88%|�[32m████████▊ �[0m| 88/100 [01:17<00:10, 1.10it/s] evaluating Epoch: 89%|�[32m████████▉ �[0m| 89/100 [01:17<00:09, 1.10it/s] evaluating Epoch: 89%|�[32m████████▉ �[0m| 89/100 [01:18<00:10, 1.10it/s] evaluating Epoch: 90%|�[32m█████████ �[0m| 90/100 [01:18<00:09, 1.10it/s] evaluating Epoch: 90%|�[32m█████████ �[0m| 90/100 [01:19<00:09, 1.10it/s] evaluating Epoch: 91%|�[32m█████████ �[0m| 91/100 [01:18<00:08, 1.10it/s] evaluating Epoch: 91%|�[32m█████████ �[0m| 91/100 [01:20<00:08, 1.10it/s] evaluating Epoch: 92%|�[32m█████████��[0m| 92/100 [01:20<00:07, 1.10it/s] evaluating Epoch: 92%|�[32m█████████��[0m| 92/100 [01:19<00:07, 1.10it/s] evaluating Epoch: 93%|�[32m█████████▎�[0m| 93/100 [01:21<00:06, 1.10it/s] evaluating Epoch: 93%|�[32m█████████▎�[0m| 93/100 [01:20<00:06, 1.10it/s] evaluating Epoch: 94%|�[32m█████████��[0m| 94/100 [01:21<00:05, 1.11it/s] evaluating Epoch: 94%|�[32m█████████��[0m| 94/100 [01:22<00:05, 1.11it/s] evaluating Epoch: 95%|�[32m█████████▌�[0m| 95/100 [01:22<00:04, 1.11it/s] evaluating Epoch: 95%|�[32m█████████▌�[0m| 95/100 [01:23<00:04, 1.11it/s] evaluating Epoch: 96%|�[32m█████████▌�[0m| 96/100 [01:23<00:03, 1.11it/s] evaluating Epoch: 96%|�[32m█████████▌�[0m| 96/100 [01:24<00:03, 1.11it/s] evaluating Epoch: 97%|�[32m█████████▋�[0m| 97/100 [01:24<00:02, 1.11it/s] evaluating Epoch: 97%|�[32m█████████▋�[0m| 97/100 [01:25<00:02, 1.10it/s] evaluating Epoch: 98%|�[32m█████████▊�[0m| 98/100 [01:26<00:01, 1.09it/s] evaluating Epoch: 98%|�[32m█████████▊�[0m| 98/100 [01:25<00:01, 1.09it/s] evaluating Epoch: 99%|�[32m█████████▉�[0m| 99/100 [01:26<00:00, 1.09it/s] evaluating Epoch: 99%|�[32m█████████▉�[0m| 99/100 [01:27<00:00, 1.09it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00, 1.08it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00, 1.08it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00, 1.15it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00, 1.13it/s] eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0') Epoch 1: train_perplexity=2.8321, train_epoch_loss=1.0410, epoch time 406.24218282848597s Key: avg_train_prep, Value: 2.8321006298065186 Key: avg_train_loss, Value: 1.0410187244415283 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 406.24218282848597 Key: avg_checkpoint_time, Value: 7.697194814682007e-05
Hi, I have encountered the same issue. Did you manage to solve it?
I have the same problem. The eval loss is NaN. No output folder.
eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0') Epoch 1: train_perplexity=2.9854, train_epoch_loss=1.0937, epoch time 3146.0126402731985s Key: avg_train_prep, Value: 2.985416889190674 Key: avg_train_loss, Value: 1.09373939037323 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 3146.0126402731985 Key: avg_checkpoint_time, Value: 6.0286372900009155e-05
Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.
Here is my log:
WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2023-09-18 08:00:02,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-18 08:00:02,256] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) Clearing GPU cache for all ranks --> Running with torch dist debug set to detail Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.31s/it] Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.38s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:10<00:05, 5.19s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:10<00:05, 5.22s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.16s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.46s/it] Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.22s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.50s/it] Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. --> Model /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf --> /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf has 13016.02816 Million params The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. The class this function is called from is 'LlamaTokenizer'. The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. The class this function is called from is 'LlamaTokenizer'. trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306 bFloat16 enabled for mixed precision - using bfSixteen policy trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306 --> applying fsdp activation checkpointing... --> Training Set Length = 6233 --> Validation Set Length = 200 --> applying fsdp activation checkpointing... /GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( Training Epoch: 1: 0%|�[34m �[0m| 0/12 [00:00<?, ?it/s]/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( Training Epoch: 1: 0%|�[34m �[0m| 0/12 [00:00<?, ?it/s] Training Epoch: 1: 8%|�[34m▊ �[0m| 1/12 [00:36<06:40, 36.40s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002): 8%|�[34m▊ �[0m| 1/12 [00:36<06:40, 36.40s/it] Training Epoch: 1: 8%|�[34m▊ �[0m| 1/12 [00:36<06:45, 36.86s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802): 8%|�[34m▊ �[0m| 1/12 [00:36<06:45, 36.86s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:44, 34.42s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:44, 34.42s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:42, 34.26s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:42, 34.26s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143): 25%|�[34m██▌ �[0m| 3/12 [01:44<05:14, 34.94s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165): 25%|�[34m██▌ �[0m| 3/12 [01:44<05:14, 34.94s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772): 25%|�[34m██▌ �[0m| 3/12 [01:45<05:15, 35.05s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344): 25%|�[34m██▌ �[0m| 3/12 [01:45<05:15, 35.05s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344): 33%|�[34m███▎ �[0m| 4/12 [02:18<04:33, 34.20s/it] Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268): 33%|�[34m███▎ �[0m| 4/12 [02:18<04:33, 34.20s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165): 33%|�[34m███▎ �[0m| 4/12 [02:17<04:33, 34.14s/it] Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208): 33%|�[34m███▎ �[0m| 4/12 [02:17<04:33, 34.14s/it] Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.82s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.82s/it] Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.85s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.85s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.62s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.62s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.64s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.64s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.46s/it] Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.46s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.45s/it] Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.45s/it] Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674): 67%|�[34m██████▋ �[0m| 8/12 [04:31<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295): 67%|�[34m██████▋ �[0m| 8/12 [04:31<02:13, 33.38s/it] Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637): 67%|�[34m██████▋ �[0m| 8/12 [04:30<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533): 67%|�[34m██████▋ �[0m| 8/12 [04:30<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.39s/it] Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.39s/it] Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923): 92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906): 92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.72s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.69s/it] Max CUDA memory allocated was 58 GB Max CUDA memory reserved was 77 GB Peak active CUDA memory was 58 GB Cuda Malloc retires : 35 CPU Total Peak Memory consumed during the train (max): 3 GB evaluating Epoch: 0%|�[32m �[0m| 0/100 [00:00<?, ?it/s] evaluating Epoch: 0%|�[32m �[0m| 0/100 [00:00<?, ?it/s] evaluating Epoch: 1%|�[32m �[0m| 1/100 [00:02<03:25, 2.08s/it] evaluating Epoch: 1%|�[32m �[0m| 1/100 [00:01<01:39, 1.00s/it] evaluating Epoch: 2%|�[32m� �[0m| 2/100 [00:01<01:26, 1.13it/s] evaluating Epoch: 2%|�[32m� �[0m| 2/100 [00:02<02:10, 1.33s/it] evaluating Epoch: 3%|�[32m▎ �[0m| 3/100 [00:03<01:45, 1.09s/it] evaluating Epoch: 3%|�[32m▎ �[0m| 3/100 [00:02<01:22, 1.18it/s] evaluating Epoch: 4%|�[32m� �[0m| 4/100 [00:04<01:33, 1.02it/s] evaluating Epoch: 4%|�[32m� �[0m| 4/100 [00:03<01:19, 1.20it/s] evaluating Epoch: 5%|�[32m▌ �[0m| 5/100 [00:04<01:18, 1.22it/s] evaluating Epoch: 5%|�[32m▌ �[0m| 5/100 [00:05<01:27, 1.09it/s] evaluating Epoch: 6%|�[32m▌ �[0m| 6/100 [00:05<01:16, 1.22it/s] evaluating Epoch: 6%|�[32m▌ �[0m| 6/100 [00:06<01:22, 1.14it/s] evaluating Epoch: 7%|�[32m▋ �[0m| 7/100 [00:05<01:15, 1.23it/s] evaluating Epoch: 7%|�[32m▋ �[0m| 7/100 [00:06<01:19, 1.17it/s] evaluating Epoch: 8%|�[32m▊ �[0m| 8/100 [00:06<01:14, 1.23it/s] evaluating Epoch: 8%|�[32m▊ �[0m| 8/100 [00:07<01:17, 1.19it/s] evaluating Epoch: 9%|�[32m▉ �[0m| 9/100 [00:07<01:13, 1.24it/s] evaluating Epoch: 9%|�[32m▉ �[0m| 9/100 [00:08<01:15, 1.21it/s] evaluating Epoch: 10%|�[32m█ �[0m| 10/100 [00:09<01:13, 1.22it/s] evaluating Epoch: 10%|�[32m█ �[0m| 10/100 [00:08<01:12, 1.24it/s] evaluating Epoch: 11%|�[32m█ �[0m| 11/100 [00:10<01:12, 1.23it/s] evaluating Epoch: 11%|�[32m█ �[0m| 11/100 [00:09<01:11, 1.24it/s] evaluating Epoch: 12%|�[32m█� �[0m| 12/100 [00:09<01:10, 1.24it/s] evaluating Epoch: 12%|�[32m█� �[0m| 12/100 [00:10<01:11, 1.23it/s] evaluating Epoch: 13%|�[32m█▎ �[0m| 13/100 [00:11<01:10, 1.24it/s] evaluating Epoch: 13%|�[32m█▎ �[0m| 13/100 [00:10<01:09, 1.24it/s] evaluating Epoch: 14%|�[32m█� �[0m| 14/100 [00:12<01:09, 1.24it/s] evaluating Epoch: 14%|�[32m█� �[0m| 14/100 [00:11<01:09, 1.25it/s] evaluating Epoch: 15%|�[32m█▌ �[0m| 15/100 [00:13<01:08, 1.24it/s] evaluating Epoch: 15%|�[32m█▌ �[0m| 15/100 [00:12<01:08, 1.24it/s] evaluating Epoch: 16%|�[32m█▌ �[0m| 16/100 [00:13<01:07, 1.25it/s] evaluating Epoch: 16%|�[32m█▌ �[0m| 16/100 [00:14<01:07, 1.24it/s] evaluating Epoch: 17%|�[32m█▋ �[0m| 17/100 [00:13<01:07, 1.24it/s] evaluating Epoch: 17%|�[32m█▋ �[0m| 17/100 [00:14<01:07, 1.23it/s] evaluating Epoch: 18%|�[32m█▊ �[0m| 18/100 [00:14<01:06, 1.24it/s] evaluating Epoch: 18%|�[32m█▊ �[0m| 18/100 [00:15<01:06, 1.24it/s] evaluating Epoch: 19%|�[32m█▉ �[0m| 19/100 [00:16<01:05, 1.24it/s] evaluating Epoch: 19%|�[32m█▉ �[0m| 19/100 [00:15<01:05, 1.24it/s] evaluating Epoch: 20%|�[32m██ �[0m| 20/100 [00:16<01:04, 1.24it/s] evaluating Epoch: 20%|�[32m██ �[0m| 20/100 [00:17<01:04, 1.24it/s] evaluating Epoch: 21%|�[32m██ �[0m| 21/100 [00:17<01:03, 1.24it/s] evaluating Epoch: 21%|�[32m██ �[0m| 21/100 [00:18<01:03, 1.24it/s] evaluating Epoch: 22%|�[32m██� �[0m| 22/100 [00:17<01:02, 1.24it/s] evaluating Epoch: 22%|�[32m██� �[0m| 22/100 [00:18<01:02, 1.24it/s] evaluating Epoch: 23%|�[32m██▎ �[0m| 23/100 [00:19<01:02, 1.24it/s] evaluating Epoch: 23%|�[32m██▎ �[0m| 23/100 [00:18<01:02, 1.24it/s] evaluating Epoch: 24%|�[32m██� �[0m| 24/100 [00:20<01:01, 1.24it/s] evaluating Epoch: 24%|�[32m██� �[0m| 24/100 [00:19<01:01, 1.24it/s] evaluating Epoch: 25%|�[32m██▌ �[0m| 25/100 [00:21<01:00, 1.24it/s] evaluating Epoch: 25%|�[32m██▌ �[0m| 25/100 [00:20<01:00, 1.24it/s] evaluating Epoch: 26%|�[32m██▌ �[0m| 26/100 [00:22<00:59, 1.24it/s] evaluating Epoch: 26%|�[32m██▌ �[0m| 26/100 [00:21<00:59, 1.24it/s] evaluating Epoch: 27%|�[32m██▋ �[0m| 27/100 [00:21<00:58, 1.24it/s] evaluating Epoch: 27%|�[32m██▋ �[0m| 27/100 [00:23<00:58, 1.24it/s] evaluating Epoch: 28%|�[32m██▊ �[0m| 28/100 [00:22<00:57, 1.24it/s] evaluating Epoch: 28%|�[32m██▊ �[0m| 28/100 [00:23<00:57, 1.24it/s] evaluating Epoch: 29%|�[32m██▉ �[0m| 29/100 [00:23<00:57, 1.24it/s] evaluating Epoch: 29%|�[32m██▉ �[0m| 29/100 [00:24<00:57, 1.24it/s] evaluating Epoch: 30%|�[32m███ �[0m| 30/100 [00:25<00:56, 1.24it/s] evaluating Epoch: 30%|�[32m███ �[0m| 30/100 [00:24<00:56, 1.24it/s] evaluating Epoch: 31%|�[32m███ �[0m| 31/100 [00:25<00:55, 1.24it/s] evaluating Epoch: 31%|�[32m███ �[0m| 31/100 [00:26<00:55, 1.24it/s] evaluating Epoch: 32%|�[32m███� �[0m| 32/100 [00:27<00:54, 1.25it/s] evaluating Epoch: 32%|�[32m███� �[0m| 32/100 [00:25<00:54, 1.24it/s] evaluating Epoch: 33%|�[32m███▎ �[0m| 33/100 [00:26<00:53, 1.24it/s] evaluating Epoch: 33%|�[32m███▎ �[0m| 33/100 [00:27<00:53, 1.24it/s] evaluating Epoch: 34%|�[32m███� �[0m| 34/100 [00:28<00:52, 1.25it/s] evaluating Epoch: 34%|�[32m███� �[0m| 34/100 [00:27<00:52, 1.25it/s] evaluating Epoch: 35%|�[32m███▌ �[0m| 35/100 [00:28<00:52, 1.24it/s] evaluating Epoch: 35%|�[32m███▌ �[0m| 35/100 [00:29<00:52, 1.24it/s] evaluating Epoch: 36%|�[32m███▌ �[0m| 36/100 [00:30<00:51, 1.25it/s] evaluating Epoch: 36%|�[32m███▌ �[0m| 36/100 [00:29<00:51, 1.25it/s] evaluating Epoch: 37%|�[32m███▋ �[0m| 37/100 [00:29<00:50, 1.25it/s] evaluating Epoch: 37%|�[32m███▋ �[0m| 37/100 [00:31<00:50, 1.25it/s] evaluating Epoch: 38%|�[32m███▊ �[0m| 38/100 [00:31<00:49, 1.26it/s] evaluating Epoch: 38%|�[32m███▊ �[0m| 38/100 [00:30<00:49, 1.26it/s] evaluating Epoch: 39%|�[32m███▉ �[0m| 39/100 [00:31<00:50, 1.21it/s] evaluating Epoch: 39%|�[32m███▉ �[0m| 39/100 [00:32<00:50, 1.21it/s] evaluating Epoch: 40%|�[32m████ �[0m| 40/100 [00:32<00:50, 1.18it/s] evaluating Epoch: 40%|�[32m████ �[0m| 40/100 [00:33<00:50, 1.18it/s] evaluating Epoch: 41%|�[32m████ �[0m| 41/100 [00:34<00:51, 1.15it/s] evaluating Epoch: 41%|�[32m████ �[0m| 41/100 [00:33<00:51, 1.15it/s] evaluating Epoch: 42%|�[32m████� �[0m| 42/100 [00:35<00:51, 1.12it/s] evaluating Epoch: 42%|�[32m████� �[0m| 42/100 [00:34<00:51, 1.12it/s] evaluating Epoch: 43%|�[32m████▎ �[0m| 43/100 [00:36<00:51, 1.12it/s] evaluating Epoch: 43%|�[32m████▎ �[0m| 43/100 [00:35<00:51, 1.12it/s] evaluating Epoch: 44%|�[32m████� �[0m| 44/100 [00:36<00:50, 1.12it/s] evaluating Epoch: 44%|�[32m████� �[0m| 44/100 [00:37<00:50, 1.11it/s] evaluating Epoch: 45%|�[32m████▌ �[0m| 45/100 [00:37<00:49, 1.11it/s] evaluating Epoch: 45%|�[32m████▌ �[0m| 45/100 [00:38<00:49, 1.11it/s] evaluating Epoch: 46%|�[32m████▌ �[0m| 46/100 [00:38<00:49, 1.10it/s] evaluating Epoch: 46%|�[32m████▌ �[0m| 46/100 [00:39<00:49, 1.10it/s] evaluating Epoch: 47%|�[32m████▋ �[0m| 47/100 [00:40<00:48, 1.09it/s] evaluating Epoch: 47%|�[32m████▋ �[0m| 47/100 [00:38<00:48, 1.09it/s] evaluating Epoch: 48%|�[32m████▊ �[0m| 48/100 [00:39<00:47, 1.10it/s] evaluating Epoch: 48%|�[32m████▊ �[0m| 48/100 [00:40<00:47, 1.10it/s] evaluating Epoch: 49%|�[32m████▉ �[0m| 49/100 [00:40<00:46, 1.10it/s] evaluating Epoch: 49%|�[32m████▉ �[0m| 49/100 [00:41<00:46, 1.10it/s] evaluating Epoch: 50%|�[32m█████ �[0m| 50/100 [00:41<00:45, 1.11it/s] evaluating Epoch: 50%|�[32m█████ �[0m| 50/100 [00:42<00:45, 1.11it/s] evaluating Epoch: 51%|�[32m█████ �[0m| 51/100 [00:43<00:44, 1.10it/s] evaluating Epoch: 51%|�[32m█████ �[0m| 51/100 [00:42<00:44, 1.10it/s] evaluating Epoch: 52%|�[32m█████� �[0m| 52/100 [00:43<00:43, 1.11it/s] evaluating Epoch: 52%|�[32m█████� �[0m| 52/100 [00:44<00:43, 1.10it/s] evaluating Epoch: 53%|�[32m█████▎ �[0m| 53/100 [00:44<00:42, 1.11it/s] evaluating Epoch: 53%|�[32m█████▎ �[0m| 53/100 [00:45<00:42, 1.11it/s] evaluating Epoch: 54%|�[32m█████� �[0m| 54/100 [00:45<00:41, 1.11it/s] evaluating Epoch: 54%|�[32m█████� �[0m| 54/100 [00:46<00:41, 1.11it/s] evaluating Epoch: 55%|�[32m█████▌ �[0m| 55/100 [00:46<00:40, 1.11it/s] evaluating Epoch: 55%|�[32m█████▌ �[0m| 55/100 [00:47<00:40, 1.11it/s] evaluating Epoch: 56%|�[32m█████▌ �[0m| 56/100 [00:47<00:39, 1.11it/s] evaluating Epoch: 56%|�[32m█████▌ �[0m| 56/100 [00:48<00:39, 1.11it/s] evaluating Epoch: 57%|�[32m█████▋ �[0m| 57/100 [00:49<00:38, 1.11it/s] evaluating Epoch: 57%|�[32m█████▋ �[0m| 57/100 [00:47<00:38, 1.11it/s] evaluating Epoch: 58%|�[32m█████▊ �[0m| 58/100 [00:48<00:38, 1.10it/s] evaluating Epoch: 58%|�[32m█████▊ �[0m| 58/100 [00:49<00:38, 1.10it/s] evaluating Epoch: 59%|�[32m█████▉ �[0m| 59/100 [00:50<00:37, 1.09it/s] evaluating Epoch: 59%|�[32m█████▉ �[0m| 59/100 [00:49<00:37, 1.09it/s] evaluating Epoch: 60%|�[32m██████ �[0m| 60/100 [00:50<00:36, 1.09it/s] evaluating Epoch: 60%|�[32m██████ �[0m| 60/100 [00:51<00:36, 1.09it/s] evaluating Epoch: 61%|�[32m██████ �[0m| 61/100 [00:51<00:35, 1.10it/s] evaluating Epoch: 61%|�[32m██████ �[0m| 61/100 [00:52<00:35, 1.10it/s] evaluating Epoch: 62%|�[32m██████� �[0m| 62/100 [00:53<00:34, 1.11it/s] evaluating Epoch: 62%|�[32m██████� �[0m| 62/100 [00:52<00:34, 1.11it/s] evaluating Epoch: 63%|�[32m██████▎ �[0m| 63/100 [00:53<00:33, 1.11it/s] evaluating Epoch: 63%|�[32m██████▎ �[0m| 63/100 [00:54<00:33, 1.11it/s] evaluating Epoch: 64%|�[32m██████� �[0m| 64/100 [00:54<00:32, 1.11it/s] evaluating Epoch: 64%|�[32m██████� �[0m| 64/100 [00:55<00:32, 1.11it/s] evaluating Epoch: 65%|�[32m██████▌ �[0m| 65/100 [00:55<00:31, 1.11it/s] evaluating Epoch: 65%|�[32m██████▌ �[0m| 65/100 [00:56<00:31, 1.11it/s] evaluating Epoch: 66%|�[32m██████▌ �[0m| 66/100 [00:56<00:30, 1.11it/s] evaluating Epoch: 66%|�[32m██████▌ �[0m| 66/100 [00:57<00:30, 1.11it/s] evaluating Epoch: 67%|�[32m██████▋ �[0m| 67/100 [00:58<00:29, 1.11it/s] evaluating Epoch: 67%|�[32m██████▋ �[0m| 67/100 [00:57<00:29, 1.11it/s] evaluating Epoch: 68%|�[32m██████▊ �[0m| 68/100 [00:57<00:28, 1.11it/s] evaluating Epoch: 68%|�[32m██████▊ �[0m| 68/100 [00:59<00:28, 1.11it/s] evaluating Epoch: 69%|�[32m██████▉ �[0m| 69/100 [00:58<00:28, 1.10it/s] evaluating Epoch: 69%|�[32m██████▉ �[0m| 69/100 [00:59<00:28, 1.10it/s] evaluating Epoch: 70%|�[32m███████ �[0m| 70/100 [01:00<00:27, 1.11it/s] evaluating Epoch: 70%|�[32m███████ �[0m| 70/100 [00:59<00:27, 1.11it/s] evaluating Epoch: 71%|�[32m███████ �[0m| 71/100 [01:00<00:26, 1.10it/s] evaluating Epoch: 71%|�[32m███████ �[0m| 71/100 [01:01<00:26, 1.10it/s] evaluating Epoch: 72%|�[32m███████� �[0m| 72/100 [01:01<00:25, 1.11it/s] evaluating Epoch: 72%|�[32m███████� �[0m| 72/100 [01:02<00:25, 1.11it/s] evaluating Epoch: 73%|�[32m███████▎ �[0m| 73/100 [01:03<00:24, 1.09it/s] evaluating Epoch: 73%|�[32m███████▎ �[0m| 73/100 [01:02<00:24, 1.09it/s] evaluating Epoch: 74%|�[32m███████� �[0m| 74/100 [01:04<00:23, 1.09it/s] evaluating Epoch: 74%|�[32m███████� �[0m| 74/100 [01:03<00:23, 1.08it/s] evaluating Epoch: 75%|�[32m███████▌ �[0m| 75/100 [01:05<00:23, 1.08it/s] evaluating Epoch: 75%|�[32m███████▌ �[0m| 75/100 [01:04<00:23, 1.07it/s] evaluating Epoch: 76%|�[32m███████▌ �[0m| 76/100 [01:05<00:22, 1.07it/s] evaluating Epoch: 76%|�[32m███████▌ �[0m| 76/100 [01:06<00:22, 1.07it/s] evaluating Epoch: 77%|�[32m███████▋ �[0m| 77/100 [01:07<00:21, 1.08it/s] evaluating Epoch: 77%|�[32m███████▋ �[0m| 77/100 [01:06<00:21, 1.08it/s] evaluating Epoch: 78%|�[32m███████▊ �[0m| 78/100 [01:08<00:20, 1.09it/s] evaluating Epoch: 78%|�[32m███████▊ �[0m| 78/100 [01:07<00:20, 1.09it/s] evaluating Epoch: 79%|�[32m███████▉ �[0m| 79/100 [01:09<00:19, 1.09it/s] evaluating Epoch: 79%|�[32m███████▉ �[0m| 79/100 [01:08<00:19, 1.09it/s] evaluating Epoch: 80%|�[32m████████ �[0m| 80/100 [01:08<00:18, 1.10it/s] evaluating Epoch: 80%|�[32m████████ �[0m| 80/100 [01:10<00:18, 1.10it/s] evaluating Epoch: 81%|�[32m████████ �[0m| 81/100 [01:09<00:17, 1.10it/s] evaluating Epoch: 81%|�[32m████████ �[0m| 81/100 [01:10<00:17, 1.10it/s] evaluating Epoch: 82%|�[32m████████� �[0m| 82/100 [01:10<00:16, 1.10it/s] evaluating Epoch: 82%|�[32m████████� �[0m| 82/100 [01:11<00:16, 1.10it/s] evaluating Epoch: 83%|�[32m████████▎ �[0m| 83/100 [01:12<00:15, 1.10it/s] evaluating Epoch: 83%|�[32m████████▎ �[0m| 83/100 [01:11<00:15, 1.10it/s] evaluating Epoch: 84%|�[32m████████� �[0m| 84/100 [01:12<00:14, 1.10it/s] evaluating Epoch: 84%|�[32m████████� �[0m| 84/100 [01:13<00:14, 1.10it/s] evaluating Epoch: 85%|�[32m████████▌ �[0m| 85/100 [01:14<00:13, 1.10it/s] evaluating Epoch: 85%|�[32m████████▌ �[0m| 85/100 [01:13<00:13, 1.09it/s] evaluating Epoch: 86%|�[32m████████▌ �[0m| 86/100 [01:14<00:12, 1.10it/s] evaluating Epoch: 86%|�[32m████████▌ �[0m| 86/100 [01:15<00:12, 1.10it/s] evaluating Epoch: 87%|�[32m████████▋ �[0m| 87/100 [01:16<00:11, 1.10it/s] evaluating Epoch: 87%|�[32m████████▋ �[0m| 87/100 [01:15<00:11, 1.10it/s] evaluating Epoch: 88%|�[32m████████▊ �[0m| 88/100 [01:16<00:10, 1.10it/s] evaluating Epoch: 88%|�[32m████████▊ �[0m| 88/100 [01:17<00:10, 1.10it/s] evaluating Epoch: 89%|�[32m████████▉ �[0m| 89/100 [01:17<00:09, 1.10it/s] evaluating Epoch: 89%|�[32m████████▉ �[0m| 89/100 [01:18<00:10, 1.10it/s] evaluating Epoch: 90%|�[32m█████████ �[0m| 90/100 [01:18<00:09, 1.10it/s] evaluating Epoch: 90%|�[32m█████████ �[0m| 90/100 [01:19<00:09, 1.10it/s] evaluating Epoch: 91%|�[32m█████████ �[0m| 91/100 [01:18<00:08, 1.10it/s] evaluating Epoch: 91%|�[32m█████████ �[0m| 91/100 [01:20<00:08, 1.10it/s] evaluating Epoch: 92%|�[32m█████████��[0m| 92/100 [01:20<00:07, 1.10it/s] evaluating Epoch: 92%|�[32m█████████��[0m| 92/100 [01:19<00:07, 1.10it/s] evaluating Epoch: 93%|�[32m█████████▎�[0m| 93/100 [01:21<00:06, 1.10it/s] evaluating Epoch: 93%|�[32m█████████▎�[0m| 93/100 [01:20<00:06, 1.10it/s] evaluating Epoch: 94%|�[32m█████████��[0m| 94/100 [01:21<00:05, 1.11it/s] evaluating Epoch: 94%|�[32m█████████��[0m| 94/100 [01:22<00:05, 1.11it/s] evaluating Epoch: 95%|�[32m█████████▌�[0m| 95/100 [01:22<00:04, 1.11it/s] evaluating Epoch: 95%|�[32m█████████▌�[0m| 95/100 [01:23<00:04, 1.11it/s] evaluating Epoch: 96%|�[32m█████████▌�[0m| 96/100 [01:23<00:03, 1.11it/s] evaluating Epoch: 96%|�[32m█████████▌�[0m| 96/100 [01:24<00:03, 1.11it/s] evaluating Epoch: 97%|�[32m█████████▋�[0m| 97/100 [01:24<00:02, 1.11it/s] evaluating Epoch: 97%|�[32m█████████▋�[0m| 97/100 [01:25<00:02, 1.10it/s] evaluating Epoch: 98%|�[32m█████████▊�[0m| 98/100 [01:26<00:01, 1.09it/s] evaluating Epoch: 98%|�[32m█████████▊�[0m| 98/100 [01:25<00:01, 1.09it/s] evaluating Epoch: 99%|�[32m█████████▉�[0m| 99/100 [01:26<00:00, 1.09it/s] evaluating Epoch: 99%|�[32m█████████▉�[0m| 99/100 [01:27<00:00, 1.09it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00, 1.08it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00, 1.08it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00, 1.15it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00, 1.13it/s] eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0') Epoch 1: train_perplexity=2.8321, train_epoch_loss=1.0410, epoch time 406.24218282848597s Key: avg_train_prep, Value: 2.8321006298065186 Key: avg_train_loss, Value: 1.0410187244415283 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 406.24218282848597 Key: avg_checkpoint_time, Value: 7.697194814682007e-05
Hi, I have encountered the same issue. Did you manage to solve it?
I haven't solved this problem yet, but I guess it's a problem with the dataset. Perhaps some inappropriate data caused the loss to be NaN. Although I have not been successful yet, I think it is possible to prevent such situations by adding some statements to the source code.
Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.
Here is my log:
WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2023-09-18 08:00:02,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-18 08:00:02,256] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) Clearing GPU cache for all ranks --> Running with torch dist debug set to detail Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.31s/it] Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.38s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:10<00:05, 5.19s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:10<00:05, 5.22s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.16s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.46s/it] Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.22s/it] Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00, 4.50s/it] Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details. --> Model /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf --> /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf has 13016.02816 Million params The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. The class this function is called from is 'LlamaTokenizer'. The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. The class this function is called from is 'LlamaTokenizer'. trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306 bFloat16 enabled for mixed precision - using bfSixteen policy trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306 --> applying fsdp activation checkpointing... --> Training Set Length = 6233 --> Validation Set Length = 200 --> applying fsdp activation checkpointing... /GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( Training Epoch: 1: 0%|�[34m �[0m| 0/12 [00:00<?, ?it/s]/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( Training Epoch: 1: 0%|�[34m �[0m| 0/12 [00:00<?, ?it/s] Training Epoch: 1: 8%|�[34m▊ �[0m| 1/12 [00:36<06:40, 36.40s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002): 8%|�[34m▊ �[0m| 1/12 [00:36<06:40, 36.40s/it] Training Epoch: 1: 8%|�[34m▊ �[0m| 1/12 [00:36<06:45, 36.86s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802): 8%|�[34m▊ �[0m| 1/12 [00:36<06:45, 36.86s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:44, 34.42s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:44, 34.42s/it] Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:42, 34.26s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143): 17%|�[34m█▋ �[0m| 2/12 [01:09<05:42, 34.26s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143): 25%|�[34m██▌ �[0m| 3/12 [01:44<05:14, 34.94s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165): 25%|�[34m██▌ �[0m| 3/12 [01:44<05:14, 34.94s/it] Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772): 25%|�[34m██▌ �[0m| 3/12 [01:45<05:15, 35.05s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344): 25%|�[34m██▌ �[0m| 3/12 [01:45<05:15, 35.05s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344): 33%|�[34m███▎ �[0m| 4/12 [02:18<04:33, 34.20s/it] Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268): 33%|�[34m███▎ �[0m| 4/12 [02:18<04:33, 34.20s/it] Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165): 33%|�[34m███▎ �[0m| 4/12 [02:17<04:33, 34.14s/it] Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208): 33%|�[34m███▎ �[0m| 4/12 [02:17<04:33, 34.14s/it] Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.82s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.82s/it] Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.85s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362): 42%|�[34m████� �[0m| 5/12 [02:51<03:56, 33.85s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.62s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.62s/it] Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.64s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293): 50%|�[34m█████ �[0m| 6/12 [03:24<03:21, 33.64s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.46s/it] Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.46s/it] Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.45s/it] Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637): 58%|�[34m█████▊ �[0m| 7/12 [03:57<02:47, 33.45s/it] Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674): 67%|�[34m██████▋ �[0m| 8/12 [04:31<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295): 67%|�[34m██████▋ �[0m| 8/12 [04:31<02:13, 33.38s/it] Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637): 67%|�[34m██████▋ �[0m| 8/12 [04:30<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533): 67%|�[34m██████▋ �[0m| 8/12 [04:30<02:13, 33.38s/it] Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.39s/it] Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.39s/it] Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212): 75%|�[34m███████▌ �[0m| 9/12 [05:04<01:40, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it] Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906): 83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923): 92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it] Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906): 92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.72s/it] Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.69s/it] Max CUDA memory allocated was 58 GB Max CUDA memory reserved was 77 GB Peak active CUDA memory was 58 GB Cuda Malloc retires : 35 CPU Total Peak Memory consumed during the train (max): 3 GB evaluating Epoch: 0%|�[32m �[0m| 0/100 [00:00<?, ?it/s] evaluating Epoch: 0%|�[32m �[0m| 0/100 [00:00<?, ?it/s] evaluating Epoch: 1%|�[32m �[0m| 1/100 [00:02<03:25, 2.08s/it] evaluating Epoch: 1%|�[32m �[0m| 1/100 [00:01<01:39, 1.00s/it] evaluating Epoch: 2%|�[32m� �[0m| 2/100 [00:01<01:26, 1.13it/s] evaluating Epoch: 2%|�[32m� �[0m| 2/100 [00:02<02:10, 1.33s/it] evaluating Epoch: 3%|�[32m▎ �[0m| 3/100 [00:03<01:45, 1.09s/it] evaluating Epoch: 3%|�[32m▎ �[0m| 3/100 [00:02<01:22, 1.18it/s] evaluating Epoch: 4%|�[32m� �[0m| 4/100 [00:04<01:33, 1.02it/s] evaluating Epoch: 4%|�[32m� �[0m| 4/100 [00:03<01:19, 1.20it/s] evaluating Epoch: 5%|�[32m▌ �[0m| 5/100 [00:04<01:18, 1.22it/s] evaluating Epoch: 5%|�[32m▌ �[0m| 5/100 [00:05<01:27, 1.09it/s] evaluating Epoch: 6%|�[32m▌ �[0m| 6/100 [00:05<01:16, 1.22it/s] evaluating Epoch: 6%|�[32m▌ �[0m| 6/100 [00:06<01:22, 1.14it/s] evaluating Epoch: 7%|�[32m▋ �[0m| 7/100 [00:05<01:15, 1.23it/s] evaluating Epoch: 7%|�[32m▋ �[0m| 7/100 [00:06<01:19, 1.17it/s] evaluating Epoch: 8%|�[32m▊ �[0m| 8/100 [00:06<01:14, 1.23it/s] evaluating Epoch: 8%|�[32m▊ �[0m| 8/100 [00:07<01:17, 1.19it/s] evaluating Epoch: 9%|�[32m▉ �[0m| 9/100 [00:07<01:13, 1.24it/s] evaluating Epoch: 9%|�[32m▉ �[0m| 9/100 [00:08<01:15, 1.21it/s] evaluating Epoch: 10%|�[32m█ �[0m| 10/100 [00:09<01:13, 1.22it/s] evaluating Epoch: 10%|�[32m█ �[0m| 10/100 [00:08<01:12, 1.24it/s] evaluating Epoch: 11%|�[32m█ �[0m| 11/100 [00:10<01:12, 1.23it/s] evaluating Epoch: 11%|�[32m█ �[0m| 11/100 [00:09<01:11, 1.24it/s] evaluating Epoch: 12%|�[32m█� �[0m| 12/100 [00:09<01:10, 1.24it/s] evaluating Epoch: 12%|�[32m█� �[0m| 12/100 [00:10<01:11, 1.23it/s] evaluating Epoch: 13%|�[32m█▎ �[0m| 13/100 [00:11<01:10, 1.24it/s] evaluating Epoch: 13%|�[32m█▎ �[0m| 13/100 [00:10<01:09, 1.24it/s] evaluating Epoch: 14%|�[32m█� �[0m| 14/100 [00:12<01:09, 1.24it/s] evaluating Epoch: 14%|�[32m█� �[0m| 14/100 [00:11<01:09, 1.25it/s] evaluating Epoch: 15%|�[32m█▌ �[0m| 15/100 [00:13<01:08, 1.24it/s] evaluating Epoch: 15%|�[32m█▌ �[0m| 15/100 [00:12<01:08, 1.24it/s] evaluating Epoch: 16%|�[32m█▌ �[0m| 16/100 [00:13<01:07, 1.25it/s] evaluating Epoch: 16%|�[32m█▌ �[0m| 16/100 [00:14<01:07, 1.24it/s] evaluating Epoch: 17%|�[32m█▋ �[0m| 17/100 [00:13<01:07, 1.24it/s] evaluating Epoch: 17%|�[32m█▋ �[0m| 17/100 [00:14<01:07, 1.23it/s] evaluating Epoch: 18%|�[32m█▊ �[0m| 18/100 [00:14<01:06, 1.24it/s] evaluating Epoch: 18%|�[32m█▊ �[0m| 18/100 [00:15<01:06, 1.24it/s] evaluating Epoch: 19%|�[32m█▉ �[0m| 19/100 [00:16<01:05, 1.24it/s] evaluating Epoch: 19%|�[32m█▉ �[0m| 19/100 [00:15<01:05, 1.24it/s] evaluating Epoch: 20%|�[32m██ �[0m| 20/100 [00:16<01:04, 1.24it/s] evaluating Epoch: 20%|�[32m██ �[0m| 20/100 [00:17<01:04, 1.24it/s] evaluating Epoch: 21%|�[32m██ �[0m| 21/100 [00:17<01:03, 1.24it/s] evaluating Epoch: 21%|�[32m██ �[0m| 21/100 [00:18<01:03, 1.24it/s] evaluating Epoch: 22%|�[32m██� �[0m| 22/100 [00:17<01:02, 1.24it/s] evaluating Epoch: 22%|�[32m██� �[0m| 22/100 [00:18<01:02, 1.24it/s] evaluating Epoch: 23%|�[32m██▎ �[0m| 23/100 [00:19<01:02, 1.24it/s] evaluating Epoch: 23%|�[32m██▎ �[0m| 23/100 [00:18<01:02, 1.24it/s] evaluating Epoch: 24%|�[32m██� �[0m| 24/100 [00:20<01:01, 1.24it/s] evaluating Epoch: 24%|�[32m██� �[0m| 24/100 [00:19<01:01, 1.24it/s] evaluating Epoch: 25%|�[32m██▌ �[0m| 25/100 [00:21<01:00, 1.24it/s] evaluating Epoch: 25%|�[32m██▌ �[0m| 25/100 [00:20<01:00, 1.24it/s] evaluating Epoch: 26%|�[32m██▌ �[0m| 26/100 [00:22<00:59, 1.24it/s] evaluating Epoch: 26%|�[32m██▌ �[0m| 26/100 [00:21<00:59, 1.24it/s] evaluating Epoch: 27%|�[32m██▋ �[0m| 27/100 [00:21<00:58, 1.24it/s] evaluating Epoch: 27%|�[32m██▋ �[0m| 27/100 [00:23<00:58, 1.24it/s] evaluating Epoch: 28%|�[32m██▊ �[0m| 28/100 [00:22<00:57, 1.24it/s] evaluating Epoch: 28%|�[32m██▊ �[0m| 28/100 [00:23<00:57, 1.24it/s] evaluating Epoch: 29%|�[32m██▉ �[0m| 29/100 [00:23<00:57, 1.24it/s] evaluating Epoch: 29%|�[32m██▉ �[0m| 29/100 [00:24<00:57, 1.24it/s] evaluating Epoch: 30%|�[32m███ �[0m| 30/100 [00:25<00:56, 1.24it/s] evaluating Epoch: 30%|�[32m███ �[0m| 30/100 [00:24<00:56, 1.24it/s] evaluating Epoch: 31%|�[32m███ �[0m| 31/100 [00:25<00:55, 1.24it/s] evaluating Epoch: 31%|�[32m███ �[0m| 31/100 [00:26<00:55, 1.24it/s] evaluating Epoch: 32%|�[32m███� �[0m| 32/100 [00:27<00:54, 1.25it/s] evaluating Epoch: 32%|�[32m███� �[0m| 32/100 [00:25<00:54, 1.24it/s] evaluating Epoch: 33%|�[32m███▎ �[0m| 33/100 [00:26<00:53, 1.24it/s] evaluating Epoch: 33%|�[32m███▎ �[0m| 33/100 [00:27<00:53, 1.24it/s] evaluating Epoch: 34%|�[32m███� �[0m| 34/100 [00:28<00:52, 1.25it/s] evaluating Epoch: 34%|�[32m███� �[0m| 34/100 [00:27<00:52, 1.25it/s] evaluating Epoch: 35%|�[32m███▌ �[0m| 35/100 [00:28<00:52, 1.24it/s] evaluating Epoch: 35%|�[32m███▌ �[0m| 35/100 [00:29<00:52, 1.24it/s] evaluating Epoch: 36%|�[32m███▌ �[0m| 36/100 [00:30<00:51, 1.25it/s] evaluating Epoch: 36%|�[32m███▌ �[0m| 36/100 [00:29<00:51, 1.25it/s] evaluating Epoch: 37%|�[32m███▋ �[0m| 37/100 [00:29<00:50, 1.25it/s] evaluating Epoch: 37%|�[32m███▋ �[0m| 37/100 [00:31<00:50, 1.25it/s] evaluating Epoch: 38%|�[32m███▊ �[0m| 38/100 [00:31<00:49, 1.26it/s] evaluating Epoch: 38%|�[32m███▊ �[0m| 38/100 [00:30<00:49, 1.26it/s] evaluating Epoch: 39%|�[32m███▉ �[0m| 39/100 [00:31<00:50, 1.21it/s] evaluating Epoch: 39%|�[32m███▉ �[0m| 39/100 [00:32<00:50, 1.21it/s] evaluating Epoch: 40%|�[32m████ �[0m| 40/100 [00:32<00:50, 1.18it/s] evaluating Epoch: 40%|�[32m████ �[0m| 40/100 [00:33<00:50, 1.18it/s] evaluating Epoch: 41%|�[32m████ �[0m| 41/100 [00:34<00:51, 1.15it/s] evaluating Epoch: 41%|�[32m████ �[0m| 41/100 [00:33<00:51, 1.15it/s] evaluating Epoch: 42%|�[32m████� �[0m| 42/100 [00:35<00:51, 1.12it/s] evaluating Epoch: 42%|�[32m████� �[0m| 42/100 [00:34<00:51, 1.12it/s] evaluating Epoch: 43%|�[32m████▎ �[0m| 43/100 [00:36<00:51, 1.12it/s] evaluating Epoch: 43%|�[32m████▎ �[0m| 43/100 [00:35<00:51, 1.12it/s] evaluating Epoch: 44%|�[32m████� �[0m| 44/100 [00:36<00:50, 1.12it/s] evaluating Epoch: 44%|�[32m████� �[0m| 44/100 [00:37<00:50, 1.11it/s] evaluating Epoch: 45%|�[32m████▌ �[0m| 45/100 [00:37<00:49, 1.11it/s] evaluating Epoch: 45%|�[32m████▌ �[0m| 45/100 [00:38<00:49, 1.11it/s] evaluating Epoch: 46%|�[32m████▌ �[0m| 46/100 [00:38<00:49, 1.10it/s] evaluating Epoch: 46%|�[32m████▌ �[0m| 46/100 [00:39<00:49, 1.10it/s] evaluating Epoch: 47%|�[32m████▋ �[0m| 47/100 [00:40<00:48, 1.09it/s] evaluating Epoch: 47%|�[32m████▋ �[0m| 47/100 [00:38<00:48, 1.09it/s] evaluating Epoch: 48%|�[32m████▊ �[0m| 48/100 [00:39<00:47, 1.10it/s] evaluating Epoch: 48%|�[32m████▊ �[0m| 48/100 [00:40<00:47, 1.10it/s] evaluating Epoch: 49%|�[32m████▉ �[0m| 49/100 [00:40<00:46, 1.10it/s] evaluating Epoch: 49%|�[32m████▉ �[0m| 49/100 [00:41<00:46, 1.10it/s] evaluating Epoch: 50%|�[32m█████ �[0m| 50/100 [00:41<00:45, 1.11it/s] evaluating Epoch: 50%|�[32m█████ �[0m| 50/100 [00:42<00:45, 1.11it/s] evaluating Epoch: 51%|�[32m█████ �[0m| 51/100 [00:43<00:44, 1.10it/s] evaluating Epoch: 51%|�[32m█████ �[0m| 51/100 [00:42<00:44, 1.10it/s] evaluating Epoch: 52%|�[32m█████� �[0m| 52/100 [00:43<00:43, 1.11it/s] evaluating Epoch: 52%|�[32m█████� �[0m| 52/100 [00:44<00:43, 1.10it/s] evaluating Epoch: 53%|�[32m█████▎ �[0m| 53/100 [00:44<00:42, 1.11it/s] evaluating Epoch: 53%|�[32m█████▎ �[0m| 53/100 [00:45<00:42, 1.11it/s] evaluating Epoch: 54%|�[32m█████� �[0m| 54/100 [00:45<00:41, 1.11it/s] evaluating Epoch: 54%|�[32m█████� �[0m| 54/100 [00:46<00:41, 1.11it/s] evaluating Epoch: 55%|�[32m█████▌ �[0m| 55/100 [00:46<00:40, 1.11it/s] evaluating Epoch: 55%|�[32m█████▌ �[0m| 55/100 [00:47<00:40, 1.11it/s] evaluating Epoch: 56%|�[32m█████▌ �[0m| 56/100 [00:47<00:39, 1.11it/s] evaluating Epoch: 56%|�[32m█████▌ �[0m| 56/100 [00:48<00:39, 1.11it/s] evaluating Epoch: 57%|�[32m█████▋ �[0m| 57/100 [00:49<00:38, 1.11it/s] evaluating Epoch: 57%|�[32m█████▋ �[0m| 57/100 [00:47<00:38, 1.11it/s] evaluating Epoch: 58%|�[32m█████▊ �[0m| 58/100 [00:48<00:38, 1.10it/s] evaluating Epoch: 58%|�[32m█████▊ �[0m| 58/100 [00:49<00:38, 1.10it/s] evaluating Epoch: 59%|�[32m█████▉ �[0m| 59/100 [00:50<00:37, 1.09it/s] evaluating Epoch: 59%|�[32m█████▉ �[0m| 59/100 [00:49<00:37, 1.09it/s] evaluating Epoch: 60%|�[32m██████ �[0m| 60/100 [00:50<00:36, 1.09it/s] evaluating Epoch: 60%|�[32m██████ �[0m| 60/100 [00:51<00:36, 1.09it/s] evaluating Epoch: 61%|�[32m██████ �[0m| 61/100 [00:51<00:35, 1.10it/s] evaluating Epoch: 61%|�[32m██████ �[0m| 61/100 [00:52<00:35, 1.10it/s] evaluating Epoch: 62%|�[32m██████� �[0m| 62/100 [00:53<00:34, 1.11it/s] evaluating Epoch: 62%|�[32m██████� �[0m| 62/100 [00:52<00:34, 1.11it/s] evaluating Epoch: 63%|�[32m██████▎ �[0m| 63/100 [00:53<00:33, 1.11it/s] evaluating Epoch: 63%|�[32m██████▎ �[0m| 63/100 [00:54<00:33, 1.11it/s] evaluating Epoch: 64%|�[32m██████� �[0m| 64/100 [00:54<00:32, 1.11it/s] evaluating Epoch: 64%|�[32m██████� �[0m| 64/100 [00:55<00:32, 1.11it/s] evaluating Epoch: 65%|�[32m██████▌ �[0m| 65/100 [00:55<00:31, 1.11it/s] evaluating Epoch: 65%|�[32m██████▌ �[0m| 65/100 [00:56<00:31, 1.11it/s] evaluating Epoch: 66%|�[32m██████▌ �[0m| 66/100 [00:56<00:30, 1.11it/s] evaluating Epoch: 66%|�[32m██████▌ �[0m| 66/100 [00:57<00:30, 1.11it/s] evaluating Epoch: 67%|�[32m██████▋ �[0m| 67/100 [00:58<00:29, 1.11it/s] evaluating Epoch: 67%|�[32m██████▋ �[0m| 67/100 [00:57<00:29, 1.11it/s] evaluating Epoch: 68%|�[32m██████▊ �[0m| 68/100 [00:57<00:28, 1.11it/s] evaluating Epoch: 68%|�[32m██████▊ �[0m| 68/100 [00:59<00:28, 1.11it/s] evaluating Epoch: 69%|�[32m██████▉ �[0m| 69/100 [00:58<00:28, 1.10it/s] evaluating Epoch: 69%|�[32m██████▉ �[0m| 69/100 [00:59<00:28, 1.10it/s] evaluating Epoch: 70%|�[32m███████ �[0m| 70/100 [01:00<00:27, 1.11it/s] evaluating Epoch: 70%|�[32m███████ �[0m| 70/100 [00:59<00:27, 1.11it/s] evaluating Epoch: 71%|�[32m███████ �[0m| 71/100 [01:00<00:26, 1.10it/s] evaluating Epoch: 71%|�[32m███████ �[0m| 71/100 [01:01<00:26, 1.10it/s] evaluating Epoch: 72%|�[32m███████� �[0m| 72/100 [01:01<00:25, 1.11it/s] evaluating Epoch: 72%|�[32m███████� �[0m| 72/100 [01:02<00:25, 1.11it/s] evaluating Epoch: 73%|�[32m███████▎ �[0m| 73/100 [01:03<00:24, 1.09it/s] evaluating Epoch: 73%|�[32m███████▎ �[0m| 73/100 [01:02<00:24, 1.09it/s] evaluating Epoch: 74%|�[32m███████� �[0m| 74/100 [01:04<00:23, 1.09it/s] evaluating Epoch: 74%|�[32m███████� �[0m| 74/100 [01:03<00:23, 1.08it/s] evaluating Epoch: 75%|�[32m███████▌ �[0m| 75/100 [01:05<00:23, 1.08it/s] evaluating Epoch: 75%|�[32m███████▌ �[0m| 75/100 [01:04<00:23, 1.07it/s] evaluating Epoch: 76%|�[32m███████▌ �[0m| 76/100 [01:05<00:22, 1.07it/s] evaluating Epoch: 76%|�[32m███████▌ �[0m| 76/100 [01:06<00:22, 1.07it/s] evaluating Epoch: 77%|�[32m███████▋ �[0m| 77/100 [01:07<00:21, 1.08it/s] evaluating Epoch: 77%|�[32m███████▋ �[0m| 77/100 [01:06<00:21, 1.08it/s] evaluating Epoch: 78%|�[32m███████▊ �[0m| 78/100 [01:08<00:20, 1.09it/s] evaluating Epoch: 78%|�[32m███████▊ �[0m| 78/100 [01:07<00:20, 1.09it/s] evaluating Epoch: 79%|�[32m███████▉ �[0m| 79/100 [01:09<00:19, 1.09it/s] evaluating Epoch: 79%|�[32m███████▉ �[0m| 79/100 [01:08<00:19, 1.09it/s] evaluating Epoch: 80%|�[32m████████ �[0m| 80/100 [01:08<00:18, 1.10it/s] evaluating Epoch: 80%|�[32m████████ �[0m| 80/100 [01:10<00:18, 1.10it/s] evaluating Epoch: 81%|�[32m████████ �[0m| 81/100 [01:09<00:17, 1.10it/s] evaluating Epoch: 81%|�[32m████████ �[0m| 81/100 [01:10<00:17, 1.10it/s] evaluating Epoch: 82%|�[32m████████� �[0m| 82/100 [01:10<00:16, 1.10it/s] evaluating Epoch: 82%|�[32m████████� �[0m| 82/100 [01:11<00:16, 1.10it/s] evaluating Epoch: 83%|�[32m████████▎ �[0m| 83/100 [01:12<00:15, 1.10it/s] evaluating Epoch: 83%|�[32m████████▎ �[0m| 83/100 [01:11<00:15, 1.10it/s] evaluating Epoch: 84%|�[32m████████� �[0m| 84/100 [01:12<00:14, 1.10it/s] evaluating Epoch: 84%|�[32m████████� �[0m| 84/100 [01:13<00:14, 1.10it/s] evaluating Epoch: 85%|�[32m████████▌ �[0m| 85/100 [01:14<00:13, 1.10it/s] evaluating Epoch: 85%|�[32m████████▌ �[0m| 85/100 [01:13<00:13, 1.09it/s] evaluating Epoch: 86%|�[32m████████▌ �[0m| 86/100 [01:14<00:12, 1.10it/s] evaluating Epoch: 86%|�[32m████████▌ �[0m| 86/100 [01:15<00:12, 1.10it/s] evaluating Epoch: 87%|�[32m████████▋ �[0m| 87/100 [01:16<00:11, 1.10it/s] evaluating Epoch: 87%|�[32m████████▋ �[0m| 87/100 [01:15<00:11, 1.10it/s] evaluating Epoch: 88%|�[32m████████▊ �[0m| 88/100 [01:16<00:10, 1.10it/s] evaluating Epoch: 88%|�[32m████████▊ �[0m| 88/100 [01:17<00:10, 1.10it/s] evaluating Epoch: 89%|�[32m████████▉ �[0m| 89/100 [01:17<00:09, 1.10it/s] evaluating Epoch: 89%|�[32m████████▉ �[0m| 89/100 [01:18<00:10, 1.10it/s] evaluating Epoch: 90%|�[32m█████████ �[0m| 90/100 [01:18<00:09, 1.10it/s] evaluating Epoch: 90%|�[32m█████████ �[0m| 90/100 [01:19<00:09, 1.10it/s] evaluating Epoch: 91%|�[32m█████████ �[0m| 91/100 [01:18<00:08, 1.10it/s] evaluating Epoch: 91%|�[32m█████████ �[0m| 91/100 [01:20<00:08, 1.10it/s] evaluating Epoch: 92%|�[32m█████████��[0m| 92/100 [01:20<00:07, 1.10it/s] evaluating Epoch: 92%|�[32m█████████��[0m| 92/100 [01:19<00:07, 1.10it/s] evaluating Epoch: 93%|�[32m█████████▎�[0m| 93/100 [01:21<00:06, 1.10it/s] evaluating Epoch: 93%|�[32m█████████▎�[0m| 93/100 [01:20<00:06, 1.10it/s] evaluating Epoch: 94%|�[32m█████████��[0m| 94/100 [01:21<00:05, 1.11it/s] evaluating Epoch: 94%|�[32m█████████��[0m| 94/100 [01:22<00:05, 1.11it/s] evaluating Epoch: 95%|�[32m█████████▌�[0m| 95/100 [01:22<00:04, 1.11it/s] evaluating Epoch: 95%|�[32m█████████▌�[0m| 95/100 [01:23<00:04, 1.11it/s] evaluating Epoch: 96%|�[32m█████████▌�[0m| 96/100 [01:23<00:03, 1.11it/s] evaluating Epoch: 96%|�[32m█████████▌�[0m| 96/100 [01:24<00:03, 1.11it/s] evaluating Epoch: 97%|�[32m█████████▋�[0m| 97/100 [01:24<00:02, 1.11it/s] evaluating Epoch: 97%|�[32m█████████▋�[0m| 97/100 [01:25<00:02, 1.10it/s] evaluating Epoch: 98%|�[32m█████████▊�[0m| 98/100 [01:26<00:01, 1.09it/s] evaluating Epoch: 98%|�[32m█████████▊�[0m| 98/100 [01:25<00:01, 1.09it/s] evaluating Epoch: 99%|�[32m█████████▉�[0m| 99/100 [01:26<00:00, 1.09it/s] evaluating Epoch: 99%|�[32m█████████▉�[0m| 99/100 [01:27<00:00, 1.09it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00, 1.08it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00, 1.08it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00, 1.15it/s] evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00, 1.13it/s] eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0') Epoch 1: train_perplexity=2.8321, train_epoch_loss=1.0410, epoch time 406.24218282848597s Key: avg_train_prep, Value: 2.8321006298065186 Key: avg_train_loss, Value: 1.0410187244415283 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 406.24218282848597 Key: avg_checkpoint_time, Value: 7.697194814682007e-05
Hi, I have encountered the same issue. Did you manage to solve it?
I haven't solved this problem yet, but I guess it's a problem with the dataset. Perhaps some inappropriate data caused the loss to be NaN. Although I have not been successful yet, I think it is possible to prevent such situations by adding some statements to the source code.
yes,you are right.I tried the dataset he provided according to the command: wget - P src/llama_ recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json.And the result was successful.
same
Hi! It seems that a solution has been provided to the issue and there has not been a follow-up conversation for a long time. I will close this issue for now and feel free to reopen it if you have any questions!