llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

No output folder

Open Tejaswgupta opened this issue 1 year ago • 14 comments

System Info

Collecting environment information... PyTorch version: 2.2.0.dev20230912+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.27.2 Libc version: glibc-2.31

Python version: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] (64-bit runtime) Python platform: Linux-5.15.0-1040-azure-x86_64-with-glibc2.10 Is CUDA available: True CUDA runtime version: 11.3.109 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe GPU 1: NVIDIA A100 80GB PCIe GPU 2: NVIDIA A100 80GB PCIe GPU 3: NVIDIA A100 80GB PCIe

Nvidia driver version: 470.182.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 4 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7V13 64-Core Processor Stepping: 1 CPU MHz: 2445.438 BogoMIPS: 4890.87 Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 3 MiB L1i cache: 3 MiB L2 cache: 48 MiB L3 cache: 384 MiB NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47 NUMA node2 CPU(s): 48-71 NUMA node3 CPU(s): 72-95 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm

Versions of relevant libraries: [pip3] flake8==6.0.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.21.6 [pip3] pytorch-transformers==1.0.0 [pip3] pytorch-triton==2.1.0+6e4932cda8 [pip3] torch==2.2.0.dev20230912+cu118 [pip3] torch-tb-profiler==0.4.1 [pip3] torchaudio==2.2.0.dev20230912+cu118 [pip3] torchvision==0.9.1 [pip3] triton==2.0.0 [conda] _pytorch_select 0.1 cpu_0 anaconda [conda] blas 1.0 mkl anaconda [conda] cudatoolkit 10.1.243 h6bb024c_0 anaconda [conda] libmklml 2019.0.5 h06a4308_0 anaconda [conda] mkl 2020.2 256 anaconda [conda] numpy 1.21.6 py38h1d589f8_0 conda-forge [conda] pytorch-transformers 1.0.0 pypi_0 pypi [conda] pytorch-triton 2.1.0+6e4932cda8 pypi_0 pypi [conda] torch 2.2.0.dev20230912+cu118 pypi_0 pypi [conda] torch-tb-profiler 0.4.1 pypi_0 pypi [conda] torchaudio 2.2.0.dev20230912+cu118 pypi_0 pypi [conda] torchvision 0.9.1 py38_cu101 pytorch [conda] triton 2.0.0 pypi_0 pypi

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

🐛 Describe the bug

Using custom dataset with a custom data loader(5000 samples). Using the following command for fine-tuning.

torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py 

The process completes without any issues , however I don't see the provided output/checkpoint folder being created. I've tried running from different directory , creating the directory manually , change the format of the directory name.

Train config:

@dataclass
class train_config:
    model_name: str="meta-llama/Llama-2-13b-hf"
    enable_fsdp: bool=True
    low_cpu_fsdp: bool=True
    run_validation: bool=False
    batch_size_training: int=12
    gradient_accumulation_steps: int=8
    num_epochs: int=1
    num_workers_dataloader: int=4
    lr: float=2e-4
    weight_decay: float=0.0
    gamma: float= 0.85
    seed: int=42
    use_fp16: bool=False
    mixed_precision: bool=True
    val_batch_size: int=1
    dataset = "legal_dataset"
    peft_method: str = "lora" # None , llama_adapter, prefix
    use_peft: bool=True
    output_dir: str = "./ft-output"
    freeze_layers: bool = False
    num_freeze_layers: int = 1
    quantization: bool = False
    one_gpu: bool = False
    save_model: bool = True
    dist_checkpoint_root_folder: str="model_checkpoints" # will be used if using FSDP
    dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
    save_optimizer: bool=True # will be used if using FSDP
    use_fast_kernels: bool = True

Error logs

Logs from terminal:

Training Epoch: 1/1, step 16/18 completed (loss: 0.20255453884601593): : 3it [12:14, 214.40s/it]                                                                                           Training Epoch: 1/1, step 16/18 completed (loss: 0.24286626279354095): : 3it [12:15, 214.48s/it]                                                                                           Training Epoch: 1/1, step 16/18 completed (loss: 0.2025240808725357): : 3it [12:15, 214.51s/it]                                                                                            Training Epoch: 1/1, step 16/18 completed (loss: 0.22577618062496185): : 3it [12:14, 214.43s/it]                                                                                           Training Epoch: 1/1, step 17/18 completed (loss: 0.2233143001794815): : 3it [12:15, 245.06s/it] 
Training Epoch: 1/1, step 17/18 completed (loss: 0.20939956605434418): : 3it [12:14, 244.96s/it]
Training Epoch: 1/1, step 17/18 completed (loss: 0.20539191365242004): : 3it [12:15, 245.12s/it]
Training Epoch: 1/1, step 17/18 completed (loss: 0.2597067356109619): : 3it [12:14, 244.99s/it]
Max CUDA memory allocated was 45 GB
Max CUDA memory reserved was 54 GB
Peak active CUDA memory was 46 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 7 GB
Epoch 1: train_perplexity=1.2628, train_epoch_loss=0.2333, epoch time 736.3742194380029s
Key: avg_train_prep, Value: 1.262771725654602
Key: avg_train_loss, Value: 0.2333090603351593
Key: avg_epoch_time, Value: 736.3742194380029
Key: avg_checkpoint_time, Value: 0

Expected behavior

Output folder is created.

Tejaswgupta avatar Sep 13 '23 16:09 Tejaswgupta

@Tejaswgupta thanks for flagging this. We need to revisit the saving logic. You selected run_validation: bool=False in your config which effectively disables saving of the result. I'll try to create a PR asap. In the meantime just enable run_validation and you should get the parameters saved. https://github.com/facebookresearch/llama-recipes/blob/c38bf5bdd370ceb93e71cfec1a07b0885a57e3ec/src/llama_recipes/utils/train_utils.py#L131

mreso avatar Sep 14 '23 13:09 mreso

My parameter run_validation is True, but the model file still cannot be output. My configuration is as follows:

@dataclass
class train_config:
    model_name: str="PATH/to/LLAMA/7B"
    enable_fsdp: bool=False
    low_cpu_fsdp: bool=False
    run_validation: bool=True
    batch_size_training: int=4
    gradient_accumulation_steps: int=1
    num_epochs: int=1
    num_workers_dataloader: int=1
    lr: float=1e-4
    weight_decay: float=0.0
    gamma: float= 0.85
    seed: int=42
    use_fp16: bool=False
    mixed_precision: bool=True
    val_batch_size: int=1
    dataset = "samsum_dataset"
    peft_method: str = "lora" # None , llama_adapter, prefix
    use_peft: bool=False
    output_dir: str = "PATH/to/save/PEFT/model"
    freeze_layers: bool = False
    num_freeze_layers: int = 1
    quantization: bool = False
    one_gpu: bool = False
    save_model: bool = True
    dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
    dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
    save_optimizer: bool=False # will be used if using FSDP
    use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels

My command is as follows:

torchrun --nnodes 1 --nproc_per_node 2 /GPUFS/nsccgz_ywang_zfd/chenchong/llama-recipes-main/finetuning.py \
--enable_fsdp --use_peft --peft_method lora \
--model_name /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf \
--pure_bf16 \
--output_dir /GPUFS/nsccgz_ywang_zfd/chenchong/abtemp \
--dataset alpaca_dataset \
--data_path /GPUFS/nsccgz_ywang_zfd/chenchong/dataset.json \
--batch_size_training 256 \
--micro_batch_size 16 \
2>&1|tee /GPUFS/nsccgz_ywang_zfd/chenchong/ft.log

Everything was okay during the fine-tuning process, but there is no model file output. How should I handle it?

BugmakerCC avatar Sep 18 '23 07:09 BugmakerCC

Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.

mreso avatar Sep 18 '23 08:09 mreso

Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.

Here is my log:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-09-18 08:00:02,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-18 08:00:02,256] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:05<00:10,  5.31s/it]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:05<00:10,  5.38s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:10<00:05,  5.19s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:10<00:05,  5.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.16s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.46s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.50s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
--> Model /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf

--> /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf has 13016.02816 Million params

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
bFloat16 enabled for mixed precision - using bfSixteen policy
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
--> applying fsdp activation checkpointing...
--> Training Set Length = 6233
--> Validation Set Length = 200
--> applying fsdp activation checkpointing...
/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|[34m          [0m| 0/12 [00:00<?, ?it/s]/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|[34m          [0m| 0/12 [00:00<?, ?it/s]
Training Epoch: 1:   8%|[34mâ–Š         [0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002):   8%|[34mâ–Š         [0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1:   8%|[34mâ–Š         [0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802):   8%|[34mâ–Š         [0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802):  17%|[34m█▋        [0m| 2/12 [01:09<05:44, 34.42s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772):  17%|[34m█▋        [0m| 2/12 [01:09<05:44, 34.42s/it] 
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002):  17%|[34m█▋        [0m| 2/12 [01:09<05:42, 34.26s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143):  17%|[34m█▋        [0m| 2/12 [01:09<05:42, 34.26s/it] 
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143):  25%|[34m██▌       [0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165):  25%|[34m██▌       [0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772):  25%|[34m██▌       [0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344):  25%|[34m██▌       [0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344):  33%|[34m███▎      [0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268):  33%|[34m███▎      [0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165):  33%|[34m███▎      [0m| 4/12 [02:17<04:33, 34.14s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208):  33%|[34m███▎      [0m| 4/12 [02:17<04:33, 34.14s/it] 
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208):  42%|[34m████�     [0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908):  42%|[34m████�     [0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268):  42%|[34m████�     [0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362):  42%|[34m████�     [0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908):  50%|[34m█████     [0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963):  50%|[34m█████     [0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362):  50%|[34m█████     [0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293):  50%|[34m█████     [0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293):  58%|[34m█████▊    [0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674):  58%|[34m█████▊    [0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963):  58%|[34m█████▊    [0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637):  58%|[34m█████▊    [0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674):  67%|[34m██████▋   [0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295):  67%|[34m██████▋   [0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637):  67%|[34m██████▋   [0m| 8/12 [04:30<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533):  67%|[34m██████▋   [0m| 8/12 [04:30<02:13, 33.38s/it] 
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295):  75%|[34m███████▌  [0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682):  75%|[34m███████▌  [0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533):  75%|[34m███████▌  [0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212):  75%|[34m███████▌  [0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682):  83%|[34m████████▎ [0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923):  83%|[34m████████▎ [0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212):  83%|[34m████████▎ [0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906):  83%|[34m████████▎ [0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923):  92%|[34m█████████�[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108):  92%|[34m█████████�[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906):  92%|[34m█████████�[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957):  92%|[34m█████████�[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.72s/it]

Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|[34m██████████[0m| 12/12 [06:44<00:00, 33.69s/it]
Max CUDA memory allocated was 58 GB
Max CUDA memory reserved was 77 GB
Peak active CUDA memory was 58 GB
Cuda Malloc retires : 35
CPU Total Peak Memory consumed during the train (max): 3 GB

evaluating Epoch:   0%|[32m          [0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch:   0%|[32m          [0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch:   1%|[32m          [0m| 1/100 [00:02<03:25,  2.08s/it]
evaluating Epoch:   1%|[32m          [0m| 1/100 [00:01<01:39,  1.00s/it]
evaluating Epoch:   2%|[32mâ–�         [0m| 2/100 [00:01<01:26,  1.13it/s]
evaluating Epoch:   2%|[32mâ–�         [0m| 2/100 [00:02<02:10,  1.33s/it]
evaluating Epoch:   3%|[32mâ–Ž         [0m| 3/100 [00:03<01:45,  1.09s/it]
evaluating Epoch:   3%|[32mâ–Ž         [0m| 3/100 [00:02<01:22,  1.18it/s]
evaluating Epoch:   4%|[32mâ–�         [0m| 4/100 [00:04<01:33,  1.02it/s]
evaluating Epoch:   4%|[32mâ–�         [0m| 4/100 [00:03<01:19,  1.20it/s]
evaluating Epoch:   5%|[32m▌         [0m| 5/100 [00:04<01:18,  1.22it/s]
evaluating Epoch:   5%|[32m▌         [0m| 5/100 [00:05<01:27,  1.09it/s]
evaluating Epoch:   6%|[32m▌         [0m| 6/100 [00:05<01:16,  1.22it/s]
evaluating Epoch:   6%|[32m▌         [0m| 6/100 [00:06<01:22,  1.14it/s]
evaluating Epoch:   7%|[32mâ–‹         [0m| 7/100 [00:05<01:15,  1.23it/s]
evaluating Epoch:   7%|[32mâ–‹         [0m| 7/100 [00:06<01:19,  1.17it/s]
evaluating Epoch:   8%|[32mâ–Š         [0m| 8/100 [00:06<01:14,  1.23it/s]
evaluating Epoch:   8%|[32mâ–Š         [0m| 8/100 [00:07<01:17,  1.19it/s]
evaluating Epoch:   9%|[32mâ–‰         [0m| 9/100 [00:07<01:13,  1.24it/s]
evaluating Epoch:   9%|[32mâ–‰         [0m| 9/100 [00:08<01:15,  1.21it/s]
evaluating Epoch:  10%|[32mâ–ˆ         [0m| 10/100 [00:09<01:13,  1.22it/s]
evaluating Epoch:  10%|[32mâ–ˆ         [0m| 10/100 [00:08<01:12,  1.24it/s]
evaluating Epoch:  11%|[32mâ–ˆ         [0m| 11/100 [00:10<01:12,  1.23it/s]
evaluating Epoch:  11%|[32mâ–ˆ         [0m| 11/100 [00:09<01:11,  1.24it/s]
evaluating Epoch:  12%|[32m█�        [0m| 12/100 [00:09<01:10,  1.24it/s]
evaluating Epoch:  12%|[32m█�        [0m| 12/100 [00:10<01:11,  1.23it/s]
evaluating Epoch:  13%|[32m█▎        [0m| 13/100 [00:11<01:10,  1.24it/s]
evaluating Epoch:  13%|[32m█▎        [0m| 13/100 [00:10<01:09,  1.24it/s]
evaluating Epoch:  14%|[32m█�        [0m| 14/100 [00:12<01:09,  1.24it/s]
evaluating Epoch:  14%|[32m█�        [0m| 14/100 [00:11<01:09,  1.25it/s]
evaluating Epoch:  15%|[32m█▌        [0m| 15/100 [00:13<01:08,  1.24it/s]
evaluating Epoch:  15%|[32m█▌        [0m| 15/100 [00:12<01:08,  1.24it/s]
evaluating Epoch:  16%|[32m█▌        [0m| 16/100 [00:13<01:07,  1.25it/s]
evaluating Epoch:  16%|[32m█▌        [0m| 16/100 [00:14<01:07,  1.24it/s]
evaluating Epoch:  17%|[32m█▋        [0m| 17/100 [00:13<01:07,  1.24it/s]
evaluating Epoch:  17%|[32m█▋        [0m| 17/100 [00:14<01:07,  1.23it/s]
evaluating Epoch:  18%|[32m█▊        [0m| 18/100 [00:14<01:06,  1.24it/s]
evaluating Epoch:  18%|[32m█▊        [0m| 18/100 [00:15<01:06,  1.24it/s]
evaluating Epoch:  19%|[32m█▉        [0m| 19/100 [00:16<01:05,  1.24it/s]
evaluating Epoch:  19%|[32m█▉        [0m| 19/100 [00:15<01:05,  1.24it/s]
evaluating Epoch:  20%|[32m██        [0m| 20/100 [00:16<01:04,  1.24it/s]
evaluating Epoch:  20%|[32m██        [0m| 20/100 [00:17<01:04,  1.24it/s]
evaluating Epoch:  21%|[32m██        [0m| 21/100 [00:17<01:03,  1.24it/s]
evaluating Epoch:  21%|[32m██        [0m| 21/100 [00:18<01:03,  1.24it/s]
evaluating Epoch:  22%|[32m██�       [0m| 22/100 [00:17<01:02,  1.24it/s]
evaluating Epoch:  22%|[32m██�       [0m| 22/100 [00:18<01:02,  1.24it/s]
evaluating Epoch:  23%|[32m██▎       [0m| 23/100 [00:19<01:02,  1.24it/s]
evaluating Epoch:  23%|[32m██▎       [0m| 23/100 [00:18<01:02,  1.24it/s]
evaluating Epoch:  24%|[32m██�       [0m| 24/100 [00:20<01:01,  1.24it/s]
evaluating Epoch:  24%|[32m██�       [0m| 24/100 [00:19<01:01,  1.24it/s]
evaluating Epoch:  25%|[32m██▌       [0m| 25/100 [00:21<01:00,  1.24it/s]
evaluating Epoch:  25%|[32m██▌       [0m| 25/100 [00:20<01:00,  1.24it/s]
evaluating Epoch:  26%|[32m██▌       [0m| 26/100 [00:22<00:59,  1.24it/s]
evaluating Epoch:  26%|[32m██▌       [0m| 26/100 [00:21<00:59,  1.24it/s]
evaluating Epoch:  27%|[32m██▋       [0m| 27/100 [00:21<00:58,  1.24it/s]
evaluating Epoch:  27%|[32m██▋       [0m| 27/100 [00:23<00:58,  1.24it/s]
evaluating Epoch:  28%|[32m██▊       [0m| 28/100 [00:22<00:57,  1.24it/s]
evaluating Epoch:  28%|[32m██▊       [0m| 28/100 [00:23<00:57,  1.24it/s]
evaluating Epoch:  29%|[32m██▉       [0m| 29/100 [00:23<00:57,  1.24it/s]
evaluating Epoch:  29%|[32m██▉       [0m| 29/100 [00:24<00:57,  1.24it/s]
evaluating Epoch:  30%|[32m███       [0m| 30/100 [00:25<00:56,  1.24it/s]
evaluating Epoch:  30%|[32m███       [0m| 30/100 [00:24<00:56,  1.24it/s]
evaluating Epoch:  31%|[32m███       [0m| 31/100 [00:25<00:55,  1.24it/s]
evaluating Epoch:  31%|[32m███       [0m| 31/100 [00:26<00:55,  1.24it/s]
evaluating Epoch:  32%|[32m███�      [0m| 32/100 [00:27<00:54,  1.25it/s]
evaluating Epoch:  32%|[32m███�      [0m| 32/100 [00:25<00:54,  1.24it/s]
evaluating Epoch:  33%|[32m███▎      [0m| 33/100 [00:26<00:53,  1.24it/s]
evaluating Epoch:  33%|[32m███▎      [0m| 33/100 [00:27<00:53,  1.24it/s]
evaluating Epoch:  34%|[32m███�      [0m| 34/100 [00:28<00:52,  1.25it/s]
evaluating Epoch:  34%|[32m███�      [0m| 34/100 [00:27<00:52,  1.25it/s]
evaluating Epoch:  35%|[32m███▌      [0m| 35/100 [00:28<00:52,  1.24it/s]
evaluating Epoch:  35%|[32m███▌      [0m| 35/100 [00:29<00:52,  1.24it/s]
evaluating Epoch:  36%|[32m███▌      [0m| 36/100 [00:30<00:51,  1.25it/s]
evaluating Epoch:  36%|[32m███▌      [0m| 36/100 [00:29<00:51,  1.25it/s]
evaluating Epoch:  37%|[32m███▋      [0m| 37/100 [00:29<00:50,  1.25it/s]
evaluating Epoch:  37%|[32m███▋      [0m| 37/100 [00:31<00:50,  1.25it/s]
evaluating Epoch:  38%|[32m███▊      [0m| 38/100 [00:31<00:49,  1.26it/s]
evaluating Epoch:  38%|[32m███▊      [0m| 38/100 [00:30<00:49,  1.26it/s]
evaluating Epoch:  39%|[32m███▉      [0m| 39/100 [00:31<00:50,  1.21it/s]
evaluating Epoch:  39%|[32m███▉      [0m| 39/100 [00:32<00:50,  1.21it/s]
evaluating Epoch:  40%|[32m████      [0m| 40/100 [00:32<00:50,  1.18it/s]
evaluating Epoch:  40%|[32m████      [0m| 40/100 [00:33<00:50,  1.18it/s]
evaluating Epoch:  41%|[32m████      [0m| 41/100 [00:34<00:51,  1.15it/s]
evaluating Epoch:  41%|[32m████      [0m| 41/100 [00:33<00:51,  1.15it/s]
evaluating Epoch:  42%|[32m████�     [0m| 42/100 [00:35<00:51,  1.12it/s]
evaluating Epoch:  42%|[32m████�     [0m| 42/100 [00:34<00:51,  1.12it/s]
evaluating Epoch:  43%|[32m████▎     [0m| 43/100 [00:36<00:51,  1.12it/s]
evaluating Epoch:  43%|[32m████▎     [0m| 43/100 [00:35<00:51,  1.12it/s]
evaluating Epoch:  44%|[32m████�     [0m| 44/100 [00:36<00:50,  1.12it/s]
evaluating Epoch:  44%|[32m████�     [0m| 44/100 [00:37<00:50,  1.11it/s]
evaluating Epoch:  45%|[32m████▌     [0m| 45/100 [00:37<00:49,  1.11it/s]
evaluating Epoch:  45%|[32m████▌     [0m| 45/100 [00:38<00:49,  1.11it/s]
evaluating Epoch:  46%|[32m████▌     [0m| 46/100 [00:38<00:49,  1.10it/s]
evaluating Epoch:  46%|[32m████▌     [0m| 46/100 [00:39<00:49,  1.10it/s]
evaluating Epoch:  47%|[32m████▋     [0m| 47/100 [00:40<00:48,  1.09it/s]
evaluating Epoch:  47%|[32m████▋     [0m| 47/100 [00:38<00:48,  1.09it/s]
evaluating Epoch:  48%|[32m████▊     [0m| 48/100 [00:39<00:47,  1.10it/s]
evaluating Epoch:  48%|[32m████▊     [0m| 48/100 [00:40<00:47,  1.10it/s]
evaluating Epoch:  49%|[32m████▉     [0m| 49/100 [00:40<00:46,  1.10it/s]
evaluating Epoch:  49%|[32m████▉     [0m| 49/100 [00:41<00:46,  1.10it/s]
evaluating Epoch:  50%|[32m█████     [0m| 50/100 [00:41<00:45,  1.11it/s]
evaluating Epoch:  50%|[32m█████     [0m| 50/100 [00:42<00:45,  1.11it/s]
evaluating Epoch:  51%|[32m█████     [0m| 51/100 [00:43<00:44,  1.10it/s]
evaluating Epoch:  51%|[32m█████     [0m| 51/100 [00:42<00:44,  1.10it/s]
evaluating Epoch:  52%|[32m█████�    [0m| 52/100 [00:43<00:43,  1.11it/s]
evaluating Epoch:  52%|[32m█████�    [0m| 52/100 [00:44<00:43,  1.10it/s]
evaluating Epoch:  53%|[32m█████▎    [0m| 53/100 [00:44<00:42,  1.11it/s]
evaluating Epoch:  53%|[32m█████▎    [0m| 53/100 [00:45<00:42,  1.11it/s]
evaluating Epoch:  54%|[32m█████�    [0m| 54/100 [00:45<00:41,  1.11it/s]
evaluating Epoch:  54%|[32m█████�    [0m| 54/100 [00:46<00:41,  1.11it/s]
evaluating Epoch:  55%|[32m█████▌    [0m| 55/100 [00:46<00:40,  1.11it/s]
evaluating Epoch:  55%|[32m█████▌    [0m| 55/100 [00:47<00:40,  1.11it/s]
evaluating Epoch:  56%|[32m█████▌    [0m| 56/100 [00:47<00:39,  1.11it/s]
evaluating Epoch:  56%|[32m█████▌    [0m| 56/100 [00:48<00:39,  1.11it/s]
evaluating Epoch:  57%|[32m█████▋    [0m| 57/100 [00:49<00:38,  1.11it/s]
evaluating Epoch:  57%|[32m█████▋    [0m| 57/100 [00:47<00:38,  1.11it/s]
evaluating Epoch:  58%|[32m█████▊    [0m| 58/100 [00:48<00:38,  1.10it/s]
evaluating Epoch:  58%|[32m█████▊    [0m| 58/100 [00:49<00:38,  1.10it/s]
evaluating Epoch:  59%|[32m█████▉    [0m| 59/100 [00:50<00:37,  1.09it/s]
evaluating Epoch:  59%|[32m█████▉    [0m| 59/100 [00:49<00:37,  1.09it/s]
evaluating Epoch:  60%|[32m██████    [0m| 60/100 [00:50<00:36,  1.09it/s]
evaluating Epoch:  60%|[32m██████    [0m| 60/100 [00:51<00:36,  1.09it/s]
evaluating Epoch:  61%|[32m██████    [0m| 61/100 [00:51<00:35,  1.10it/s]
evaluating Epoch:  61%|[32m██████    [0m| 61/100 [00:52<00:35,  1.10it/s]
evaluating Epoch:  62%|[32m██████�   [0m| 62/100 [00:53<00:34,  1.11it/s]
evaluating Epoch:  62%|[32m██████�   [0m| 62/100 [00:52<00:34,  1.11it/s]
evaluating Epoch:  63%|[32m██████▎   [0m| 63/100 [00:53<00:33,  1.11it/s]
evaluating Epoch:  63%|[32m██████▎   [0m| 63/100 [00:54<00:33,  1.11it/s]
evaluating Epoch:  64%|[32m██████�   [0m| 64/100 [00:54<00:32,  1.11it/s]
evaluating Epoch:  64%|[32m██████�   [0m| 64/100 [00:55<00:32,  1.11it/s]
evaluating Epoch:  65%|[32m██████▌   [0m| 65/100 [00:55<00:31,  1.11it/s]
evaluating Epoch:  65%|[32m██████▌   [0m| 65/100 [00:56<00:31,  1.11it/s]
evaluating Epoch:  66%|[32m██████▌   [0m| 66/100 [00:56<00:30,  1.11it/s]
evaluating Epoch:  66%|[32m██████▌   [0m| 66/100 [00:57<00:30,  1.11it/s]
evaluating Epoch:  67%|[32m██████▋   [0m| 67/100 [00:58<00:29,  1.11it/s]
evaluating Epoch:  67%|[32m██████▋   [0m| 67/100 [00:57<00:29,  1.11it/s]
evaluating Epoch:  68%|[32m██████▊   [0m| 68/100 [00:57<00:28,  1.11it/s]
evaluating Epoch:  68%|[32m██████▊   [0m| 68/100 [00:59<00:28,  1.11it/s]
evaluating Epoch:  69%|[32m██████▉   [0m| 69/100 [00:58<00:28,  1.10it/s]
evaluating Epoch:  69%|[32m██████▉   [0m| 69/100 [00:59<00:28,  1.10it/s]
evaluating Epoch:  70%|[32m███████   [0m| 70/100 [01:00<00:27,  1.11it/s]
evaluating Epoch:  70%|[32m███████   [0m| 70/100 [00:59<00:27,  1.11it/s]
evaluating Epoch:  71%|[32m███████   [0m| 71/100 [01:00<00:26,  1.10it/s]
evaluating Epoch:  71%|[32m███████   [0m| 71/100 [01:01<00:26,  1.10it/s]
evaluating Epoch:  72%|[32m███████�  [0m| 72/100 [01:01<00:25,  1.11it/s]
evaluating Epoch:  72%|[32m███████�  [0m| 72/100 [01:02<00:25,  1.11it/s]
evaluating Epoch:  73%|[32m███████▎  [0m| 73/100 [01:03<00:24,  1.09it/s]
evaluating Epoch:  73%|[32m███████▎  [0m| 73/100 [01:02<00:24,  1.09it/s]
evaluating Epoch:  74%|[32m███████�  [0m| 74/100 [01:04<00:23,  1.09it/s]
evaluating Epoch:  74%|[32m███████�  [0m| 74/100 [01:03<00:23,  1.08it/s]
evaluating Epoch:  75%|[32m███████▌  [0m| 75/100 [01:05<00:23,  1.08it/s]
evaluating Epoch:  75%|[32m███████▌  [0m| 75/100 [01:04<00:23,  1.07it/s]
evaluating Epoch:  76%|[32m███████▌  [0m| 76/100 [01:05<00:22,  1.07it/s]
evaluating Epoch:  76%|[32m███████▌  [0m| 76/100 [01:06<00:22,  1.07it/s]
evaluating Epoch:  77%|[32m███████▋  [0m| 77/100 [01:07<00:21,  1.08it/s]
evaluating Epoch:  77%|[32m███████▋  [0m| 77/100 [01:06<00:21,  1.08it/s]
evaluating Epoch:  78%|[32m███████▊  [0m| 78/100 [01:08<00:20,  1.09it/s]
evaluating Epoch:  78%|[32m███████▊  [0m| 78/100 [01:07<00:20,  1.09it/s]
evaluating Epoch:  79%|[32m███████▉  [0m| 79/100 [01:09<00:19,  1.09it/s]
evaluating Epoch:  79%|[32m███████▉  [0m| 79/100 [01:08<00:19,  1.09it/s]
evaluating Epoch:  80%|[32m████████  [0m| 80/100 [01:08<00:18,  1.10it/s]
evaluating Epoch:  80%|[32m████████  [0m| 80/100 [01:10<00:18,  1.10it/s]
evaluating Epoch:  81%|[32m████████  [0m| 81/100 [01:09<00:17,  1.10it/s]
evaluating Epoch:  81%|[32m████████  [0m| 81/100 [01:10<00:17,  1.10it/s]
evaluating Epoch:  82%|[32m████████� [0m| 82/100 [01:10<00:16,  1.10it/s]
evaluating Epoch:  82%|[32m████████� [0m| 82/100 [01:11<00:16,  1.10it/s]
evaluating Epoch:  83%|[32m████████▎ [0m| 83/100 [01:12<00:15,  1.10it/s]
evaluating Epoch:  83%|[32m████████▎ [0m| 83/100 [01:11<00:15,  1.10it/s]
evaluating Epoch:  84%|[32m████████� [0m| 84/100 [01:12<00:14,  1.10it/s]
evaluating Epoch:  84%|[32m████████� [0m| 84/100 [01:13<00:14,  1.10it/s]
evaluating Epoch:  85%|[32m████████▌ [0m| 85/100 [01:14<00:13,  1.10it/s]
evaluating Epoch:  85%|[32m████████▌ [0m| 85/100 [01:13<00:13,  1.09it/s]
evaluating Epoch:  86%|[32m████████▌ [0m| 86/100 [01:14<00:12,  1.10it/s]
evaluating Epoch:  86%|[32m████████▌ [0m| 86/100 [01:15<00:12,  1.10it/s]
evaluating Epoch:  87%|[32m████████▋ [0m| 87/100 [01:16<00:11,  1.10it/s]
evaluating Epoch:  87%|[32m████████▋ [0m| 87/100 [01:15<00:11,  1.10it/s]
evaluating Epoch:  88%|[32m████████▊ [0m| 88/100 [01:16<00:10,  1.10it/s]
evaluating Epoch:  88%|[32m████████▊ [0m| 88/100 [01:17<00:10,  1.10it/s]
evaluating Epoch:  89%|[32m████████▉ [0m| 89/100 [01:17<00:09,  1.10it/s]
evaluating Epoch:  89%|[32m████████▉ [0m| 89/100 [01:18<00:10,  1.10it/s]
evaluating Epoch:  90%|[32m█████████ [0m| 90/100 [01:18<00:09,  1.10it/s]
evaluating Epoch:  90%|[32m█████████ [0m| 90/100 [01:19<00:09,  1.10it/s]
evaluating Epoch:  91%|[32m█████████ [0m| 91/100 [01:18<00:08,  1.10it/s]
evaluating Epoch:  91%|[32m█████████ [0m| 91/100 [01:20<00:08,  1.10it/s]
evaluating Epoch:  92%|[32m█████████�[0m| 92/100 [01:20<00:07,  1.10it/s]
evaluating Epoch:  92%|[32m█████████�[0m| 92/100 [01:19<00:07,  1.10it/s]
evaluating Epoch:  93%|[32m█████████▎[0m| 93/100 [01:21<00:06,  1.10it/s]
evaluating Epoch:  93%|[32m█████████▎[0m| 93/100 [01:20<00:06,  1.10it/s]
evaluating Epoch:  94%|[32m█████████�[0m| 94/100 [01:21<00:05,  1.11it/s]
evaluating Epoch:  94%|[32m█████████�[0m| 94/100 [01:22<00:05,  1.11it/s]
evaluating Epoch:  95%|[32m█████████▌[0m| 95/100 [01:22<00:04,  1.11it/s]
evaluating Epoch:  95%|[32m█████████▌[0m| 95/100 [01:23<00:04,  1.11it/s]
evaluating Epoch:  96%|[32m█████████▌[0m| 96/100 [01:23<00:03,  1.11it/s]
evaluating Epoch:  96%|[32m█████████▌[0m| 96/100 [01:24<00:03,  1.11it/s]
evaluating Epoch:  97%|[32m█████████▋[0m| 97/100 [01:24<00:02,  1.11it/s]
evaluating Epoch:  97%|[32m█████████▋[0m| 97/100 [01:25<00:02,  1.10it/s]
evaluating Epoch:  98%|[32m█████████▊[0m| 98/100 [01:26<00:01,  1.09it/s]
evaluating Epoch:  98%|[32m█████████▊[0m| 98/100 [01:25<00:01,  1.09it/s]
evaluating Epoch:  99%|[32m█████████▉[0m| 99/100 [01:26<00:00,  1.09it/s]
evaluating Epoch:  99%|[32m█████████▉[0m| 99/100 [01:27<00:00,  1.09it/s]
evaluating Epoch: 100%|[32m██████████[0m| 100/100 [01:27<00:00,  1.08it/s]
evaluating Epoch: 100%|[32m██████████[0m| 100/100 [01:28<00:00,  1.08it/s]
evaluating Epoch: 100%|[32m██████████[0m| 100/100 [01:27<00:00,  1.15it/s]

evaluating Epoch: 100%|[32m██████████[0m| 100/100 [01:28<00:00,  1.13it/s]
 eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')
Epoch 1: train_perplexity=2.8321, train_epoch_loss=1.0410, epoch time 406.24218282848597s
Key: avg_train_prep, Value: 2.8321006298065186
Key: avg_train_loss, Value: 1.0410187244415283
Key: avg_eval_prep, Value: nan
Key: avg_eval_loss, Value: inf
Key: avg_epoch_time, Value: 406.24218282848597
Key: avg_checkpoint_time, Value: 7.697194814682007e-05

BugmakerCC avatar Sep 18 '23 08:09 BugmakerCC

Yes, your eval loss is NaN so no checkpoint gets saved:

evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00,  1.13it/s]
 eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')

Your checkpoint file also seems to be corrupted as there are weight missing:

Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', ...'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Seems like you're using an older version of llama-recipes as you're using micro_batch_size which was removed recently. Please update to the latest version to make sure you have all current fixes in.

mreso avatar Sep 18 '23 08:09 mreso

Yes, your eval loss is NaN so no checkpoint gets saved:

evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00,  1.13it/s]
 eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')

Your checkpoint file also seems to be corrupted as there are weight missing:

Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', ...'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Seems like you're using an older version of llama-recipes as you're using micro_batch_size which was removed recently. Please update to the latest version to make sure you have all current fixes in.

So why would my eval loss be NaN? Is there a problem with the dataset I used for training? Or a problem with my parameters?

BugmakerCC avatar Sep 18 '23 14:09 BugmakerCC

Can have many reasons. Are you using the original alpaca json or a modification? Did you figure out why some weights are not initialized?

mreso avatar Sep 18 '23 14:09 mreso

Can have many reasons. Are you using the original alpaca json or a modification? Did you figure out why some weights are not initialized?

I am using the original alpaca JSON. The reason why some weights were not initialized may be that I am using the Codellama model, and this fine-tuning code is for llama. Therefore, I subsequently used the llama2 model for fine-tuning, and some warnings that the weights were not initialized disappeared, but it still cannot solve the problem of eval loss being NaN.

BugmakerCC avatar Sep 18 '23 14:09 BugmakerCC

there are some issues #146, with the setting max_words in alpaca dataset, we are looking into fixing it.

HamidShojanazeri avatar Sep 18 '23 14:09 HamidShojanazeri

Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.

Here is my log:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-09-18 08:00:02,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-18 08:00:02,256] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:05<00:10,  5.31s/it]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:05<00:10,  5.38s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:10<00:05,  5.19s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:10<00:05,  5.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.16s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.46s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.50s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
--> Model /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf

--> /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf has 13016.02816 Million params

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
bFloat16 enabled for mixed precision - using bfSixteen policy
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
--> applying fsdp activation checkpointing...
--> Training Set Length = 6233
--> Validation Set Length = 200
--> applying fsdp activation checkpointing...
/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|�[34m          �[0m| 0/12 [00:00<?, ?it/s]/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|�[34m          �[0m| 0/12 [00:00<?, ?it/s]
Training Epoch: 1:   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002):   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1:   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802):   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:44, 34.42s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:44, 34.42s/it] 
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:42, 34.26s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:42, 34.26s/it] 
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143):  25%|�[34m██▌       �[0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165):  25%|�[34m██▌       �[0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772):  25%|�[34m██▌       �[0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344):  25%|�[34m██▌       �[0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344):  33%|�[34m███▎      �[0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268):  33%|�[34m███▎      �[0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165):  33%|�[34m███▎      �[0m| 4/12 [02:17<04:33, 34.14s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208):  33%|�[34m███▎      �[0m| 4/12 [02:17<04:33, 34.14s/it] 
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674):  67%|�[34m██████▋   �[0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295):  67%|�[34m██████▋   �[0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637):  67%|�[34m██████▋   �[0m| 8/12 [04:30<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533):  67%|�[34m██████▋   �[0m| 8/12 [04:30<02:13, 33.38s/it] 
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923):  92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108):  92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906):  92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957):  92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.72s/it]

Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.69s/it]
Max CUDA memory allocated was 58 GB
Max CUDA memory reserved was 77 GB
Peak active CUDA memory was 58 GB
Cuda Malloc retires : 35
CPU Total Peak Memory consumed during the train (max): 3 GB

evaluating Epoch:   0%|�[32m          �[0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch:   0%|�[32m          �[0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch:   1%|�[32m          �[0m| 1/100 [00:02<03:25,  2.08s/it]
evaluating Epoch:   1%|�[32m          �[0m| 1/100 [00:01<01:39,  1.00s/it]
evaluating Epoch:   2%|�[32mâ–�         �[0m| 2/100 [00:01<01:26,  1.13it/s]
evaluating Epoch:   2%|�[32mâ–�         �[0m| 2/100 [00:02<02:10,  1.33s/it]
evaluating Epoch:   3%|�[32mâ–Ž         �[0m| 3/100 [00:03<01:45,  1.09s/it]
evaluating Epoch:   3%|�[32mâ–Ž         �[0m| 3/100 [00:02<01:22,  1.18it/s]
evaluating Epoch:   4%|�[32mâ–�         �[0m| 4/100 [00:04<01:33,  1.02it/s]
evaluating Epoch:   4%|�[32mâ–�         �[0m| 4/100 [00:03<01:19,  1.20it/s]
evaluating Epoch:   5%|�[32m▌         �[0m| 5/100 [00:04<01:18,  1.22it/s]
evaluating Epoch:   5%|�[32m▌         �[0m| 5/100 [00:05<01:27,  1.09it/s]
evaluating Epoch:   6%|�[32m▌         �[0m| 6/100 [00:05<01:16,  1.22it/s]
evaluating Epoch:   6%|�[32m▌         �[0m| 6/100 [00:06<01:22,  1.14it/s]
evaluating Epoch:   7%|�[32mâ–‹         �[0m| 7/100 [00:05<01:15,  1.23it/s]
evaluating Epoch:   7%|�[32mâ–‹         �[0m| 7/100 [00:06<01:19,  1.17it/s]
evaluating Epoch:   8%|�[32mâ–Š         �[0m| 8/100 [00:06<01:14,  1.23it/s]
evaluating Epoch:   8%|�[32mâ–Š         �[0m| 8/100 [00:07<01:17,  1.19it/s]
evaluating Epoch:   9%|�[32mâ–‰         �[0m| 9/100 [00:07<01:13,  1.24it/s]
evaluating Epoch:   9%|�[32mâ–‰         �[0m| 9/100 [00:08<01:15,  1.21it/s]
evaluating Epoch:  10%|�[32mâ–ˆ         �[0m| 10/100 [00:09<01:13,  1.22it/s]
evaluating Epoch:  10%|�[32mâ–ˆ         �[0m| 10/100 [00:08<01:12,  1.24it/s]
evaluating Epoch:  11%|�[32mâ–ˆ         �[0m| 11/100 [00:10<01:12,  1.23it/s]
evaluating Epoch:  11%|�[32mâ–ˆ         �[0m| 11/100 [00:09<01:11,  1.24it/s]
evaluating Epoch:  12%|�[32m█�        �[0m| 12/100 [00:09<01:10,  1.24it/s]
evaluating Epoch:  12%|�[32m█�        �[0m| 12/100 [00:10<01:11,  1.23it/s]
evaluating Epoch:  13%|�[32m█▎        �[0m| 13/100 [00:11<01:10,  1.24it/s]
evaluating Epoch:  13%|�[32m█▎        �[0m| 13/100 [00:10<01:09,  1.24it/s]
evaluating Epoch:  14%|�[32m█�        �[0m| 14/100 [00:12<01:09,  1.24it/s]
evaluating Epoch:  14%|�[32m█�        �[0m| 14/100 [00:11<01:09,  1.25it/s]
evaluating Epoch:  15%|�[32m█▌        �[0m| 15/100 [00:13<01:08,  1.24it/s]
evaluating Epoch:  15%|�[32m█▌        �[0m| 15/100 [00:12<01:08,  1.24it/s]
evaluating Epoch:  16%|�[32m█▌        �[0m| 16/100 [00:13<01:07,  1.25it/s]
evaluating Epoch:  16%|�[32m█▌        �[0m| 16/100 [00:14<01:07,  1.24it/s]
evaluating Epoch:  17%|�[32m█▋        �[0m| 17/100 [00:13<01:07,  1.24it/s]
evaluating Epoch:  17%|�[32m█▋        �[0m| 17/100 [00:14<01:07,  1.23it/s]
evaluating Epoch:  18%|�[32m█▊        �[0m| 18/100 [00:14<01:06,  1.24it/s]
evaluating Epoch:  18%|�[32m█▊        �[0m| 18/100 [00:15<01:06,  1.24it/s]
evaluating Epoch:  19%|�[32m█▉        �[0m| 19/100 [00:16<01:05,  1.24it/s]
evaluating Epoch:  19%|�[32m█▉        �[0m| 19/100 [00:15<01:05,  1.24it/s]
evaluating Epoch:  20%|�[32m██        �[0m| 20/100 [00:16<01:04,  1.24it/s]
evaluating Epoch:  20%|�[32m██        �[0m| 20/100 [00:17<01:04,  1.24it/s]
evaluating Epoch:  21%|�[32m██        �[0m| 21/100 [00:17<01:03,  1.24it/s]
evaluating Epoch:  21%|�[32m██        �[0m| 21/100 [00:18<01:03,  1.24it/s]
evaluating Epoch:  22%|�[32m██�       �[0m| 22/100 [00:17<01:02,  1.24it/s]
evaluating Epoch:  22%|�[32m██�       �[0m| 22/100 [00:18<01:02,  1.24it/s]
evaluating Epoch:  23%|�[32m██▎       �[0m| 23/100 [00:19<01:02,  1.24it/s]
evaluating Epoch:  23%|�[32m██▎       �[0m| 23/100 [00:18<01:02,  1.24it/s]
evaluating Epoch:  24%|�[32m██�       �[0m| 24/100 [00:20<01:01,  1.24it/s]
evaluating Epoch:  24%|�[32m██�       �[0m| 24/100 [00:19<01:01,  1.24it/s]
evaluating Epoch:  25%|�[32m██▌       �[0m| 25/100 [00:21<01:00,  1.24it/s]
evaluating Epoch:  25%|�[32m██▌       �[0m| 25/100 [00:20<01:00,  1.24it/s]
evaluating Epoch:  26%|�[32m██▌       �[0m| 26/100 [00:22<00:59,  1.24it/s]
evaluating Epoch:  26%|�[32m██▌       �[0m| 26/100 [00:21<00:59,  1.24it/s]
evaluating Epoch:  27%|�[32m██▋       �[0m| 27/100 [00:21<00:58,  1.24it/s]
evaluating Epoch:  27%|�[32m██▋       �[0m| 27/100 [00:23<00:58,  1.24it/s]
evaluating Epoch:  28%|�[32m██▊       �[0m| 28/100 [00:22<00:57,  1.24it/s]
evaluating Epoch:  28%|�[32m██▊       �[0m| 28/100 [00:23<00:57,  1.24it/s]
evaluating Epoch:  29%|�[32m██▉       �[0m| 29/100 [00:23<00:57,  1.24it/s]
evaluating Epoch:  29%|�[32m██▉       �[0m| 29/100 [00:24<00:57,  1.24it/s]
evaluating Epoch:  30%|�[32m███       �[0m| 30/100 [00:25<00:56,  1.24it/s]
evaluating Epoch:  30%|�[32m███       �[0m| 30/100 [00:24<00:56,  1.24it/s]
evaluating Epoch:  31%|�[32m███       �[0m| 31/100 [00:25<00:55,  1.24it/s]
evaluating Epoch:  31%|�[32m███       �[0m| 31/100 [00:26<00:55,  1.24it/s]
evaluating Epoch:  32%|�[32m███�      �[0m| 32/100 [00:27<00:54,  1.25it/s]
evaluating Epoch:  32%|�[32m███�      �[0m| 32/100 [00:25<00:54,  1.24it/s]
evaluating Epoch:  33%|�[32m███▎      �[0m| 33/100 [00:26<00:53,  1.24it/s]
evaluating Epoch:  33%|�[32m███▎      �[0m| 33/100 [00:27<00:53,  1.24it/s]
evaluating Epoch:  34%|�[32m███�      �[0m| 34/100 [00:28<00:52,  1.25it/s]
evaluating Epoch:  34%|�[32m███�      �[0m| 34/100 [00:27<00:52,  1.25it/s]
evaluating Epoch:  35%|�[32m███▌      �[0m| 35/100 [00:28<00:52,  1.24it/s]
evaluating Epoch:  35%|�[32m███▌      �[0m| 35/100 [00:29<00:52,  1.24it/s]
evaluating Epoch:  36%|�[32m███▌      �[0m| 36/100 [00:30<00:51,  1.25it/s]
evaluating Epoch:  36%|�[32m███▌      �[0m| 36/100 [00:29<00:51,  1.25it/s]
evaluating Epoch:  37%|�[32m███▋      �[0m| 37/100 [00:29<00:50,  1.25it/s]
evaluating Epoch:  37%|�[32m███▋      �[0m| 37/100 [00:31<00:50,  1.25it/s]
evaluating Epoch:  38%|�[32m███▊      �[0m| 38/100 [00:31<00:49,  1.26it/s]
evaluating Epoch:  38%|�[32m███▊      �[0m| 38/100 [00:30<00:49,  1.26it/s]
evaluating Epoch:  39%|�[32m███▉      �[0m| 39/100 [00:31<00:50,  1.21it/s]
evaluating Epoch:  39%|�[32m███▉      �[0m| 39/100 [00:32<00:50,  1.21it/s]
evaluating Epoch:  40%|�[32m████      �[0m| 40/100 [00:32<00:50,  1.18it/s]
evaluating Epoch:  40%|�[32m████      �[0m| 40/100 [00:33<00:50,  1.18it/s]
evaluating Epoch:  41%|�[32m████      �[0m| 41/100 [00:34<00:51,  1.15it/s]
evaluating Epoch:  41%|�[32m████      �[0m| 41/100 [00:33<00:51,  1.15it/s]
evaluating Epoch:  42%|�[32m████�     �[0m| 42/100 [00:35<00:51,  1.12it/s]
evaluating Epoch:  42%|�[32m████�     �[0m| 42/100 [00:34<00:51,  1.12it/s]
evaluating Epoch:  43%|�[32m████▎     �[0m| 43/100 [00:36<00:51,  1.12it/s]
evaluating Epoch:  43%|�[32m████▎     �[0m| 43/100 [00:35<00:51,  1.12it/s]
evaluating Epoch:  44%|�[32m████�     �[0m| 44/100 [00:36<00:50,  1.12it/s]
evaluating Epoch:  44%|�[32m████�     �[0m| 44/100 [00:37<00:50,  1.11it/s]
evaluating Epoch:  45%|�[32m████▌     �[0m| 45/100 [00:37<00:49,  1.11it/s]
evaluating Epoch:  45%|�[32m████▌     �[0m| 45/100 [00:38<00:49,  1.11it/s]
evaluating Epoch:  46%|�[32m████▌     �[0m| 46/100 [00:38<00:49,  1.10it/s]
evaluating Epoch:  46%|�[32m████▌     �[0m| 46/100 [00:39<00:49,  1.10it/s]
evaluating Epoch:  47%|�[32m████▋     �[0m| 47/100 [00:40<00:48,  1.09it/s]
evaluating Epoch:  47%|�[32m████▋     �[0m| 47/100 [00:38<00:48,  1.09it/s]
evaluating Epoch:  48%|�[32m████▊     �[0m| 48/100 [00:39<00:47,  1.10it/s]
evaluating Epoch:  48%|�[32m████▊     �[0m| 48/100 [00:40<00:47,  1.10it/s]
evaluating Epoch:  49%|�[32m████▉     �[0m| 49/100 [00:40<00:46,  1.10it/s]
evaluating Epoch:  49%|�[32m████▉     �[0m| 49/100 [00:41<00:46,  1.10it/s]
evaluating Epoch:  50%|�[32m█████     �[0m| 50/100 [00:41<00:45,  1.11it/s]
evaluating Epoch:  50%|�[32m█████     �[0m| 50/100 [00:42<00:45,  1.11it/s]
evaluating Epoch:  51%|�[32m█████     �[0m| 51/100 [00:43<00:44,  1.10it/s]
evaluating Epoch:  51%|�[32m█████     �[0m| 51/100 [00:42<00:44,  1.10it/s]
evaluating Epoch:  52%|�[32m█████�    �[0m| 52/100 [00:43<00:43,  1.11it/s]
evaluating Epoch:  52%|�[32m█████�    �[0m| 52/100 [00:44<00:43,  1.10it/s]
evaluating Epoch:  53%|�[32m█████▎    �[0m| 53/100 [00:44<00:42,  1.11it/s]
evaluating Epoch:  53%|�[32m█████▎    �[0m| 53/100 [00:45<00:42,  1.11it/s]
evaluating Epoch:  54%|�[32m█████�    �[0m| 54/100 [00:45<00:41,  1.11it/s]
evaluating Epoch:  54%|�[32m█████�    �[0m| 54/100 [00:46<00:41,  1.11it/s]
evaluating Epoch:  55%|�[32m█████▌    �[0m| 55/100 [00:46<00:40,  1.11it/s]
evaluating Epoch:  55%|�[32m█████▌    �[0m| 55/100 [00:47<00:40,  1.11it/s]
evaluating Epoch:  56%|�[32m█████▌    �[0m| 56/100 [00:47<00:39,  1.11it/s]
evaluating Epoch:  56%|�[32m█████▌    �[0m| 56/100 [00:48<00:39,  1.11it/s]
evaluating Epoch:  57%|�[32m█████▋    �[0m| 57/100 [00:49<00:38,  1.11it/s]
evaluating Epoch:  57%|�[32m█████▋    �[0m| 57/100 [00:47<00:38,  1.11it/s]
evaluating Epoch:  58%|�[32m█████▊    �[0m| 58/100 [00:48<00:38,  1.10it/s]
evaluating Epoch:  58%|�[32m█████▊    �[0m| 58/100 [00:49<00:38,  1.10it/s]
evaluating Epoch:  59%|�[32m█████▉    �[0m| 59/100 [00:50<00:37,  1.09it/s]
evaluating Epoch:  59%|�[32m█████▉    �[0m| 59/100 [00:49<00:37,  1.09it/s]
evaluating Epoch:  60%|�[32m██████    �[0m| 60/100 [00:50<00:36,  1.09it/s]
evaluating Epoch:  60%|�[32m██████    �[0m| 60/100 [00:51<00:36,  1.09it/s]
evaluating Epoch:  61%|�[32m██████    �[0m| 61/100 [00:51<00:35,  1.10it/s]
evaluating Epoch:  61%|�[32m██████    �[0m| 61/100 [00:52<00:35,  1.10it/s]
evaluating Epoch:  62%|�[32m██████�   �[0m| 62/100 [00:53<00:34,  1.11it/s]
evaluating Epoch:  62%|�[32m██████�   �[0m| 62/100 [00:52<00:34,  1.11it/s]
evaluating Epoch:  63%|�[32m██████▎   �[0m| 63/100 [00:53<00:33,  1.11it/s]
evaluating Epoch:  63%|�[32m██████▎   �[0m| 63/100 [00:54<00:33,  1.11it/s]
evaluating Epoch:  64%|�[32m██████�   �[0m| 64/100 [00:54<00:32,  1.11it/s]
evaluating Epoch:  64%|�[32m██████�   �[0m| 64/100 [00:55<00:32,  1.11it/s]
evaluating Epoch:  65%|�[32m██████▌   �[0m| 65/100 [00:55<00:31,  1.11it/s]
evaluating Epoch:  65%|�[32m██████▌   �[0m| 65/100 [00:56<00:31,  1.11it/s]
evaluating Epoch:  66%|�[32m██████▌   �[0m| 66/100 [00:56<00:30,  1.11it/s]
evaluating Epoch:  66%|�[32m██████▌   �[0m| 66/100 [00:57<00:30,  1.11it/s]
evaluating Epoch:  67%|�[32m██████▋   �[0m| 67/100 [00:58<00:29,  1.11it/s]
evaluating Epoch:  67%|�[32m██████▋   �[0m| 67/100 [00:57<00:29,  1.11it/s]
evaluating Epoch:  68%|�[32m██████▊   �[0m| 68/100 [00:57<00:28,  1.11it/s]
evaluating Epoch:  68%|�[32m██████▊   �[0m| 68/100 [00:59<00:28,  1.11it/s]
evaluating Epoch:  69%|�[32m██████▉   �[0m| 69/100 [00:58<00:28,  1.10it/s]
evaluating Epoch:  69%|�[32m██████▉   �[0m| 69/100 [00:59<00:28,  1.10it/s]
evaluating Epoch:  70%|�[32m███████   �[0m| 70/100 [01:00<00:27,  1.11it/s]
evaluating Epoch:  70%|�[32m███████   �[0m| 70/100 [00:59<00:27,  1.11it/s]
evaluating Epoch:  71%|�[32m███████   �[0m| 71/100 [01:00<00:26,  1.10it/s]
evaluating Epoch:  71%|�[32m███████   �[0m| 71/100 [01:01<00:26,  1.10it/s]
evaluating Epoch:  72%|�[32m███████�  �[0m| 72/100 [01:01<00:25,  1.11it/s]
evaluating Epoch:  72%|�[32m███████�  �[0m| 72/100 [01:02<00:25,  1.11it/s]
evaluating Epoch:  73%|�[32m███████▎  �[0m| 73/100 [01:03<00:24,  1.09it/s]
evaluating Epoch:  73%|�[32m███████▎  �[0m| 73/100 [01:02<00:24,  1.09it/s]
evaluating Epoch:  74%|�[32m███████�  �[0m| 74/100 [01:04<00:23,  1.09it/s]
evaluating Epoch:  74%|�[32m███████�  �[0m| 74/100 [01:03<00:23,  1.08it/s]
evaluating Epoch:  75%|�[32m███████▌  �[0m| 75/100 [01:05<00:23,  1.08it/s]
evaluating Epoch:  75%|�[32m███████▌  �[0m| 75/100 [01:04<00:23,  1.07it/s]
evaluating Epoch:  76%|�[32m███████▌  �[0m| 76/100 [01:05<00:22,  1.07it/s]
evaluating Epoch:  76%|�[32m███████▌  �[0m| 76/100 [01:06<00:22,  1.07it/s]
evaluating Epoch:  77%|�[32m███████▋  �[0m| 77/100 [01:07<00:21,  1.08it/s]
evaluating Epoch:  77%|�[32m███████▋  �[0m| 77/100 [01:06<00:21,  1.08it/s]
evaluating Epoch:  78%|�[32m███████▊  �[0m| 78/100 [01:08<00:20,  1.09it/s]
evaluating Epoch:  78%|�[32m███████▊  �[0m| 78/100 [01:07<00:20,  1.09it/s]
evaluating Epoch:  79%|�[32m███████▉  �[0m| 79/100 [01:09<00:19,  1.09it/s]
evaluating Epoch:  79%|�[32m███████▉  �[0m| 79/100 [01:08<00:19,  1.09it/s]
evaluating Epoch:  80%|�[32m████████  �[0m| 80/100 [01:08<00:18,  1.10it/s]
evaluating Epoch:  80%|�[32m████████  �[0m| 80/100 [01:10<00:18,  1.10it/s]
evaluating Epoch:  81%|�[32m████████  �[0m| 81/100 [01:09<00:17,  1.10it/s]
evaluating Epoch:  81%|�[32m████████  �[0m| 81/100 [01:10<00:17,  1.10it/s]
evaluating Epoch:  82%|�[32m████████� �[0m| 82/100 [01:10<00:16,  1.10it/s]
evaluating Epoch:  82%|�[32m████████� �[0m| 82/100 [01:11<00:16,  1.10it/s]
evaluating Epoch:  83%|�[32m████████▎ �[0m| 83/100 [01:12<00:15,  1.10it/s]
evaluating Epoch:  83%|�[32m████████▎ �[0m| 83/100 [01:11<00:15,  1.10it/s]
evaluating Epoch:  84%|�[32m████████� �[0m| 84/100 [01:12<00:14,  1.10it/s]
evaluating Epoch:  84%|�[32m████████� �[0m| 84/100 [01:13<00:14,  1.10it/s]
evaluating Epoch:  85%|�[32m████████▌ �[0m| 85/100 [01:14<00:13,  1.10it/s]
evaluating Epoch:  85%|�[32m████████▌ �[0m| 85/100 [01:13<00:13,  1.09it/s]
evaluating Epoch:  86%|�[32m████████▌ �[0m| 86/100 [01:14<00:12,  1.10it/s]
evaluating Epoch:  86%|�[32m████████▌ �[0m| 86/100 [01:15<00:12,  1.10it/s]
evaluating Epoch:  87%|�[32m████████▋ �[0m| 87/100 [01:16<00:11,  1.10it/s]
evaluating Epoch:  87%|�[32m████████▋ �[0m| 87/100 [01:15<00:11,  1.10it/s]
evaluating Epoch:  88%|�[32m████████▊ �[0m| 88/100 [01:16<00:10,  1.10it/s]
evaluating Epoch:  88%|�[32m████████▊ �[0m| 88/100 [01:17<00:10,  1.10it/s]
evaluating Epoch:  89%|�[32m████████▉ �[0m| 89/100 [01:17<00:09,  1.10it/s]
evaluating Epoch:  89%|�[32m████████▉ �[0m| 89/100 [01:18<00:10,  1.10it/s]
evaluating Epoch:  90%|�[32m█████████ �[0m| 90/100 [01:18<00:09,  1.10it/s]
evaluating Epoch:  90%|�[32m█████████ �[0m| 90/100 [01:19<00:09,  1.10it/s]
evaluating Epoch:  91%|�[32m█████████ �[0m| 91/100 [01:18<00:08,  1.10it/s]
evaluating Epoch:  91%|�[32m█████████ �[0m| 91/100 [01:20<00:08,  1.10it/s]
evaluating Epoch:  92%|�[32m█████████��[0m| 92/100 [01:20<00:07,  1.10it/s]
evaluating Epoch:  92%|�[32m█████████��[0m| 92/100 [01:19<00:07,  1.10it/s]
evaluating Epoch:  93%|�[32m█████████▎�[0m| 93/100 [01:21<00:06,  1.10it/s]
evaluating Epoch:  93%|�[32m█████████▎�[0m| 93/100 [01:20<00:06,  1.10it/s]
evaluating Epoch:  94%|�[32m█████████��[0m| 94/100 [01:21<00:05,  1.11it/s]
evaluating Epoch:  94%|�[32m█████████��[0m| 94/100 [01:22<00:05,  1.11it/s]
evaluating Epoch:  95%|�[32m█████████▌�[0m| 95/100 [01:22<00:04,  1.11it/s]
evaluating Epoch:  95%|�[32m█████████▌�[0m| 95/100 [01:23<00:04,  1.11it/s]
evaluating Epoch:  96%|�[32m█████████▌�[0m| 96/100 [01:23<00:03,  1.11it/s]
evaluating Epoch:  96%|�[32m█████████▌�[0m| 96/100 [01:24<00:03,  1.11it/s]
evaluating Epoch:  97%|�[32m█████████▋�[0m| 97/100 [01:24<00:02,  1.11it/s]
evaluating Epoch:  97%|�[32m█████████▋�[0m| 97/100 [01:25<00:02,  1.10it/s]
evaluating Epoch:  98%|�[32m█████████▊�[0m| 98/100 [01:26<00:01,  1.09it/s]
evaluating Epoch:  98%|�[32m█████████▊�[0m| 98/100 [01:25<00:01,  1.09it/s]
evaluating Epoch:  99%|�[32m█████████▉�[0m| 99/100 [01:26<00:00,  1.09it/s]
evaluating Epoch:  99%|�[32m█████████▉�[0m| 99/100 [01:27<00:00,  1.09it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00,  1.08it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00,  1.08it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00,  1.15it/s]

evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00,  1.13it/s]
 eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')
Epoch 1: train_perplexity=2.8321, train_epoch_loss=1.0410, epoch time 406.24218282848597s
Key: avg_train_prep, Value: 2.8321006298065186
Key: avg_train_loss, Value: 1.0410187244415283
Key: avg_eval_prep, Value: nan
Key: avg_eval_loss, Value: inf
Key: avg_epoch_time, Value: 406.24218282848597
Key: avg_checkpoint_time, Value: 7.697194814682007e-05

Hi, I have encountered the same issue. Did you manage to solve it?

July-1024 avatar Sep 19 '23 07:09 July-1024

I have the same problem. The eval loss is NaN. No output folder. eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0') Epoch 1: train_perplexity=2.9854, train_epoch_loss=1.0937, epoch time 3146.0126402731985s Key: avg_train_prep, Value: 2.985416889190674 Key: avg_train_loss, Value: 1.09373939037323 Key: avg_eval_prep, Value: nan Key: avg_eval_loss, Value: inf Key: avg_epoch_time, Value: 3146.0126402731985 Key: avg_checkpoint_time, Value: 6.0286372900009155e-05

ACBBZ avatar Sep 19 '23 07:09 ACBBZ

Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.

Here is my log:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-09-18 08:00:02,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-18 08:00:02,256] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:05<00:10,  5.31s/it]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:05<00:10,  5.38s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:10<00:05,  5.19s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:10<00:05,  5.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.16s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.46s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.50s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
--> Model /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf

--> /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf has 13016.02816 Million params

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
bFloat16 enabled for mixed precision - using bfSixteen policy
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
--> applying fsdp activation checkpointing...
--> Training Set Length = 6233
--> Validation Set Length = 200
--> applying fsdp activation checkpointing...
/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|�[34m          �[0m| 0/12 [00:00<?, ?it/s]/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|�[34m          �[0m| 0/12 [00:00<?, ?it/s]
Training Epoch: 1:   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002):   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1:   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802):   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:44, 34.42s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:44, 34.42s/it] 
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:42, 34.26s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:42, 34.26s/it] 
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143):  25%|�[34m██▌       �[0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165):  25%|�[34m██▌       �[0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772):  25%|�[34m██▌       �[0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344):  25%|�[34m██▌       �[0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344):  33%|�[34m███▎      �[0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268):  33%|�[34m███▎      �[0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165):  33%|�[34m███▎      �[0m| 4/12 [02:17<04:33, 34.14s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208):  33%|�[34m███▎      �[0m| 4/12 [02:17<04:33, 34.14s/it] 
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674):  67%|�[34m██████▋   �[0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295):  67%|�[34m██████▋   �[0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637):  67%|�[34m██████▋   �[0m| 8/12 [04:30<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533):  67%|�[34m██████▋   �[0m| 8/12 [04:30<02:13, 33.38s/it] 
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923):  92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108):  92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906):  92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957):  92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.72s/it]

Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.69s/it]
Max CUDA memory allocated was 58 GB
Max CUDA memory reserved was 77 GB
Peak active CUDA memory was 58 GB
Cuda Malloc retires : 35
CPU Total Peak Memory consumed during the train (max): 3 GB

evaluating Epoch:   0%|�[32m          �[0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch:   0%|�[32m          �[0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch:   1%|�[32m          �[0m| 1/100 [00:02<03:25,  2.08s/it]
evaluating Epoch:   1%|�[32m          �[0m| 1/100 [00:01<01:39,  1.00s/it]
evaluating Epoch:   2%|�[32mâ–�         �[0m| 2/100 [00:01<01:26,  1.13it/s]
evaluating Epoch:   2%|�[32mâ–�         �[0m| 2/100 [00:02<02:10,  1.33s/it]
evaluating Epoch:   3%|�[32mâ–Ž         �[0m| 3/100 [00:03<01:45,  1.09s/it]
evaluating Epoch:   3%|�[32mâ–Ž         �[0m| 3/100 [00:02<01:22,  1.18it/s]
evaluating Epoch:   4%|�[32mâ–�         �[0m| 4/100 [00:04<01:33,  1.02it/s]
evaluating Epoch:   4%|�[32mâ–�         �[0m| 4/100 [00:03<01:19,  1.20it/s]
evaluating Epoch:   5%|�[32m▌         �[0m| 5/100 [00:04<01:18,  1.22it/s]
evaluating Epoch:   5%|�[32m▌         �[0m| 5/100 [00:05<01:27,  1.09it/s]
evaluating Epoch:   6%|�[32m▌         �[0m| 6/100 [00:05<01:16,  1.22it/s]
evaluating Epoch:   6%|�[32m▌         �[0m| 6/100 [00:06<01:22,  1.14it/s]
evaluating Epoch:   7%|�[32mâ–‹         �[0m| 7/100 [00:05<01:15,  1.23it/s]
evaluating Epoch:   7%|�[32mâ–‹         �[0m| 7/100 [00:06<01:19,  1.17it/s]
evaluating Epoch:   8%|�[32mâ–Š         �[0m| 8/100 [00:06<01:14,  1.23it/s]
evaluating Epoch:   8%|�[32mâ–Š         �[0m| 8/100 [00:07<01:17,  1.19it/s]
evaluating Epoch:   9%|�[32mâ–‰         �[0m| 9/100 [00:07<01:13,  1.24it/s]
evaluating Epoch:   9%|�[32mâ–‰         �[0m| 9/100 [00:08<01:15,  1.21it/s]
evaluating Epoch:  10%|�[32mâ–ˆ         �[0m| 10/100 [00:09<01:13,  1.22it/s]
evaluating Epoch:  10%|�[32mâ–ˆ         �[0m| 10/100 [00:08<01:12,  1.24it/s]
evaluating Epoch:  11%|�[32mâ–ˆ         �[0m| 11/100 [00:10<01:12,  1.23it/s]
evaluating Epoch:  11%|�[32mâ–ˆ         �[0m| 11/100 [00:09<01:11,  1.24it/s]
evaluating Epoch:  12%|�[32m█�        �[0m| 12/100 [00:09<01:10,  1.24it/s]
evaluating Epoch:  12%|�[32m█�        �[0m| 12/100 [00:10<01:11,  1.23it/s]
evaluating Epoch:  13%|�[32m█▎        �[0m| 13/100 [00:11<01:10,  1.24it/s]
evaluating Epoch:  13%|�[32m█▎        �[0m| 13/100 [00:10<01:09,  1.24it/s]
evaluating Epoch:  14%|�[32m█�        �[0m| 14/100 [00:12<01:09,  1.24it/s]
evaluating Epoch:  14%|�[32m█�        �[0m| 14/100 [00:11<01:09,  1.25it/s]
evaluating Epoch:  15%|�[32m█▌        �[0m| 15/100 [00:13<01:08,  1.24it/s]
evaluating Epoch:  15%|�[32m█▌        �[0m| 15/100 [00:12<01:08,  1.24it/s]
evaluating Epoch:  16%|�[32m█▌        �[0m| 16/100 [00:13<01:07,  1.25it/s]
evaluating Epoch:  16%|�[32m█▌        �[0m| 16/100 [00:14<01:07,  1.24it/s]
evaluating Epoch:  17%|�[32m█▋        �[0m| 17/100 [00:13<01:07,  1.24it/s]
evaluating Epoch:  17%|�[32m█▋        �[0m| 17/100 [00:14<01:07,  1.23it/s]
evaluating Epoch:  18%|�[32m█▊        �[0m| 18/100 [00:14<01:06,  1.24it/s]
evaluating Epoch:  18%|�[32m█▊        �[0m| 18/100 [00:15<01:06,  1.24it/s]
evaluating Epoch:  19%|�[32m█▉        �[0m| 19/100 [00:16<01:05,  1.24it/s]
evaluating Epoch:  19%|�[32m█▉        �[0m| 19/100 [00:15<01:05,  1.24it/s]
evaluating Epoch:  20%|�[32m██        �[0m| 20/100 [00:16<01:04,  1.24it/s]
evaluating Epoch:  20%|�[32m██        �[0m| 20/100 [00:17<01:04,  1.24it/s]
evaluating Epoch:  21%|�[32m██        �[0m| 21/100 [00:17<01:03,  1.24it/s]
evaluating Epoch:  21%|�[32m██        �[0m| 21/100 [00:18<01:03,  1.24it/s]
evaluating Epoch:  22%|�[32m██�       �[0m| 22/100 [00:17<01:02,  1.24it/s]
evaluating Epoch:  22%|�[32m██�       �[0m| 22/100 [00:18<01:02,  1.24it/s]
evaluating Epoch:  23%|�[32m██▎       �[0m| 23/100 [00:19<01:02,  1.24it/s]
evaluating Epoch:  23%|�[32m██▎       �[0m| 23/100 [00:18<01:02,  1.24it/s]
evaluating Epoch:  24%|�[32m██�       �[0m| 24/100 [00:20<01:01,  1.24it/s]
evaluating Epoch:  24%|�[32m██�       �[0m| 24/100 [00:19<01:01,  1.24it/s]
evaluating Epoch:  25%|�[32m██▌       �[0m| 25/100 [00:21<01:00,  1.24it/s]
evaluating Epoch:  25%|�[32m██▌       �[0m| 25/100 [00:20<01:00,  1.24it/s]
evaluating Epoch:  26%|�[32m██▌       �[0m| 26/100 [00:22<00:59,  1.24it/s]
evaluating Epoch:  26%|�[32m██▌       �[0m| 26/100 [00:21<00:59,  1.24it/s]
evaluating Epoch:  27%|�[32m██▋       �[0m| 27/100 [00:21<00:58,  1.24it/s]
evaluating Epoch:  27%|�[32m██▋       �[0m| 27/100 [00:23<00:58,  1.24it/s]
evaluating Epoch:  28%|�[32m██▊       �[0m| 28/100 [00:22<00:57,  1.24it/s]
evaluating Epoch:  28%|�[32m██▊       �[0m| 28/100 [00:23<00:57,  1.24it/s]
evaluating Epoch:  29%|�[32m██▉       �[0m| 29/100 [00:23<00:57,  1.24it/s]
evaluating Epoch:  29%|�[32m██▉       �[0m| 29/100 [00:24<00:57,  1.24it/s]
evaluating Epoch:  30%|�[32m███       �[0m| 30/100 [00:25<00:56,  1.24it/s]
evaluating Epoch:  30%|�[32m███       �[0m| 30/100 [00:24<00:56,  1.24it/s]
evaluating Epoch:  31%|�[32m███       �[0m| 31/100 [00:25<00:55,  1.24it/s]
evaluating Epoch:  31%|�[32m███       �[0m| 31/100 [00:26<00:55,  1.24it/s]
evaluating Epoch:  32%|�[32m███�      �[0m| 32/100 [00:27<00:54,  1.25it/s]
evaluating Epoch:  32%|�[32m███�      �[0m| 32/100 [00:25<00:54,  1.24it/s]
evaluating Epoch:  33%|�[32m███▎      �[0m| 33/100 [00:26<00:53,  1.24it/s]
evaluating Epoch:  33%|�[32m███▎      �[0m| 33/100 [00:27<00:53,  1.24it/s]
evaluating Epoch:  34%|�[32m███�      �[0m| 34/100 [00:28<00:52,  1.25it/s]
evaluating Epoch:  34%|�[32m███�      �[0m| 34/100 [00:27<00:52,  1.25it/s]
evaluating Epoch:  35%|�[32m███▌      �[0m| 35/100 [00:28<00:52,  1.24it/s]
evaluating Epoch:  35%|�[32m███▌      �[0m| 35/100 [00:29<00:52,  1.24it/s]
evaluating Epoch:  36%|�[32m███▌      �[0m| 36/100 [00:30<00:51,  1.25it/s]
evaluating Epoch:  36%|�[32m███▌      �[0m| 36/100 [00:29<00:51,  1.25it/s]
evaluating Epoch:  37%|�[32m███▋      �[0m| 37/100 [00:29<00:50,  1.25it/s]
evaluating Epoch:  37%|�[32m███▋      �[0m| 37/100 [00:31<00:50,  1.25it/s]
evaluating Epoch:  38%|�[32m███▊      �[0m| 38/100 [00:31<00:49,  1.26it/s]
evaluating Epoch:  38%|�[32m███▊      �[0m| 38/100 [00:30<00:49,  1.26it/s]
evaluating Epoch:  39%|�[32m███▉      �[0m| 39/100 [00:31<00:50,  1.21it/s]
evaluating Epoch:  39%|�[32m███▉      �[0m| 39/100 [00:32<00:50,  1.21it/s]
evaluating Epoch:  40%|�[32m████      �[0m| 40/100 [00:32<00:50,  1.18it/s]
evaluating Epoch:  40%|�[32m████      �[0m| 40/100 [00:33<00:50,  1.18it/s]
evaluating Epoch:  41%|�[32m████      �[0m| 41/100 [00:34<00:51,  1.15it/s]
evaluating Epoch:  41%|�[32m████      �[0m| 41/100 [00:33<00:51,  1.15it/s]
evaluating Epoch:  42%|�[32m████�     �[0m| 42/100 [00:35<00:51,  1.12it/s]
evaluating Epoch:  42%|�[32m████�     �[0m| 42/100 [00:34<00:51,  1.12it/s]
evaluating Epoch:  43%|�[32m████▎     �[0m| 43/100 [00:36<00:51,  1.12it/s]
evaluating Epoch:  43%|�[32m████▎     �[0m| 43/100 [00:35<00:51,  1.12it/s]
evaluating Epoch:  44%|�[32m████�     �[0m| 44/100 [00:36<00:50,  1.12it/s]
evaluating Epoch:  44%|�[32m████�     �[0m| 44/100 [00:37<00:50,  1.11it/s]
evaluating Epoch:  45%|�[32m████▌     �[0m| 45/100 [00:37<00:49,  1.11it/s]
evaluating Epoch:  45%|�[32m████▌     �[0m| 45/100 [00:38<00:49,  1.11it/s]
evaluating Epoch:  46%|�[32m████▌     �[0m| 46/100 [00:38<00:49,  1.10it/s]
evaluating Epoch:  46%|�[32m████▌     �[0m| 46/100 [00:39<00:49,  1.10it/s]
evaluating Epoch:  47%|�[32m████▋     �[0m| 47/100 [00:40<00:48,  1.09it/s]
evaluating Epoch:  47%|�[32m████▋     �[0m| 47/100 [00:38<00:48,  1.09it/s]
evaluating Epoch:  48%|�[32m████▊     �[0m| 48/100 [00:39<00:47,  1.10it/s]
evaluating Epoch:  48%|�[32m████▊     �[0m| 48/100 [00:40<00:47,  1.10it/s]
evaluating Epoch:  49%|�[32m████▉     �[0m| 49/100 [00:40<00:46,  1.10it/s]
evaluating Epoch:  49%|�[32m████▉     �[0m| 49/100 [00:41<00:46,  1.10it/s]
evaluating Epoch:  50%|�[32m█████     �[0m| 50/100 [00:41<00:45,  1.11it/s]
evaluating Epoch:  50%|�[32m█████     �[0m| 50/100 [00:42<00:45,  1.11it/s]
evaluating Epoch:  51%|�[32m█████     �[0m| 51/100 [00:43<00:44,  1.10it/s]
evaluating Epoch:  51%|�[32m█████     �[0m| 51/100 [00:42<00:44,  1.10it/s]
evaluating Epoch:  52%|�[32m█████�    �[0m| 52/100 [00:43<00:43,  1.11it/s]
evaluating Epoch:  52%|�[32m█████�    �[0m| 52/100 [00:44<00:43,  1.10it/s]
evaluating Epoch:  53%|�[32m█████▎    �[0m| 53/100 [00:44<00:42,  1.11it/s]
evaluating Epoch:  53%|�[32m█████▎    �[0m| 53/100 [00:45<00:42,  1.11it/s]
evaluating Epoch:  54%|�[32m█████�    �[0m| 54/100 [00:45<00:41,  1.11it/s]
evaluating Epoch:  54%|�[32m█████�    �[0m| 54/100 [00:46<00:41,  1.11it/s]
evaluating Epoch:  55%|�[32m█████▌    �[0m| 55/100 [00:46<00:40,  1.11it/s]
evaluating Epoch:  55%|�[32m█████▌    �[0m| 55/100 [00:47<00:40,  1.11it/s]
evaluating Epoch:  56%|�[32m█████▌    �[0m| 56/100 [00:47<00:39,  1.11it/s]
evaluating Epoch:  56%|�[32m█████▌    �[0m| 56/100 [00:48<00:39,  1.11it/s]
evaluating Epoch:  57%|�[32m█████▋    �[0m| 57/100 [00:49<00:38,  1.11it/s]
evaluating Epoch:  57%|�[32m█████▋    �[0m| 57/100 [00:47<00:38,  1.11it/s]
evaluating Epoch:  58%|�[32m█████▊    �[0m| 58/100 [00:48<00:38,  1.10it/s]
evaluating Epoch:  58%|�[32m█████▊    �[0m| 58/100 [00:49<00:38,  1.10it/s]
evaluating Epoch:  59%|�[32m█████▉    �[0m| 59/100 [00:50<00:37,  1.09it/s]
evaluating Epoch:  59%|�[32m█████▉    �[0m| 59/100 [00:49<00:37,  1.09it/s]
evaluating Epoch:  60%|�[32m██████    �[0m| 60/100 [00:50<00:36,  1.09it/s]
evaluating Epoch:  60%|�[32m██████    �[0m| 60/100 [00:51<00:36,  1.09it/s]
evaluating Epoch:  61%|�[32m██████    �[0m| 61/100 [00:51<00:35,  1.10it/s]
evaluating Epoch:  61%|�[32m██████    �[0m| 61/100 [00:52<00:35,  1.10it/s]
evaluating Epoch:  62%|�[32m██████�   �[0m| 62/100 [00:53<00:34,  1.11it/s]
evaluating Epoch:  62%|�[32m██████�   �[0m| 62/100 [00:52<00:34,  1.11it/s]
evaluating Epoch:  63%|�[32m██████▎   �[0m| 63/100 [00:53<00:33,  1.11it/s]
evaluating Epoch:  63%|�[32m██████▎   �[0m| 63/100 [00:54<00:33,  1.11it/s]
evaluating Epoch:  64%|�[32m██████�   �[0m| 64/100 [00:54<00:32,  1.11it/s]
evaluating Epoch:  64%|�[32m██████�   �[0m| 64/100 [00:55<00:32,  1.11it/s]
evaluating Epoch:  65%|�[32m██████▌   �[0m| 65/100 [00:55<00:31,  1.11it/s]
evaluating Epoch:  65%|�[32m██████▌   �[0m| 65/100 [00:56<00:31,  1.11it/s]
evaluating Epoch:  66%|�[32m██████▌   �[0m| 66/100 [00:56<00:30,  1.11it/s]
evaluating Epoch:  66%|�[32m██████▌   �[0m| 66/100 [00:57<00:30,  1.11it/s]
evaluating Epoch:  67%|�[32m██████▋   �[0m| 67/100 [00:58<00:29,  1.11it/s]
evaluating Epoch:  67%|�[32m██████▋   �[0m| 67/100 [00:57<00:29,  1.11it/s]
evaluating Epoch:  68%|�[32m██████▊   �[0m| 68/100 [00:57<00:28,  1.11it/s]
evaluating Epoch:  68%|�[32m██████▊   �[0m| 68/100 [00:59<00:28,  1.11it/s]
evaluating Epoch:  69%|�[32m██████▉   �[0m| 69/100 [00:58<00:28,  1.10it/s]
evaluating Epoch:  69%|�[32m██████▉   �[0m| 69/100 [00:59<00:28,  1.10it/s]
evaluating Epoch:  70%|�[32m███████   �[0m| 70/100 [01:00<00:27,  1.11it/s]
evaluating Epoch:  70%|�[32m███████   �[0m| 70/100 [00:59<00:27,  1.11it/s]
evaluating Epoch:  71%|�[32m███████   �[0m| 71/100 [01:00<00:26,  1.10it/s]
evaluating Epoch:  71%|�[32m███████   �[0m| 71/100 [01:01<00:26,  1.10it/s]
evaluating Epoch:  72%|�[32m███████�  �[0m| 72/100 [01:01<00:25,  1.11it/s]
evaluating Epoch:  72%|�[32m███████�  �[0m| 72/100 [01:02<00:25,  1.11it/s]
evaluating Epoch:  73%|�[32m███████▎  �[0m| 73/100 [01:03<00:24,  1.09it/s]
evaluating Epoch:  73%|�[32m███████▎  �[0m| 73/100 [01:02<00:24,  1.09it/s]
evaluating Epoch:  74%|�[32m███████�  �[0m| 74/100 [01:04<00:23,  1.09it/s]
evaluating Epoch:  74%|�[32m███████�  �[0m| 74/100 [01:03<00:23,  1.08it/s]
evaluating Epoch:  75%|�[32m███████▌  �[0m| 75/100 [01:05<00:23,  1.08it/s]
evaluating Epoch:  75%|�[32m███████▌  �[0m| 75/100 [01:04<00:23,  1.07it/s]
evaluating Epoch:  76%|�[32m███████▌  �[0m| 76/100 [01:05<00:22,  1.07it/s]
evaluating Epoch:  76%|�[32m███████▌  �[0m| 76/100 [01:06<00:22,  1.07it/s]
evaluating Epoch:  77%|�[32m███████▋  �[0m| 77/100 [01:07<00:21,  1.08it/s]
evaluating Epoch:  77%|�[32m███████▋  �[0m| 77/100 [01:06<00:21,  1.08it/s]
evaluating Epoch:  78%|�[32m███████▊  �[0m| 78/100 [01:08<00:20,  1.09it/s]
evaluating Epoch:  78%|�[32m███████▊  �[0m| 78/100 [01:07<00:20,  1.09it/s]
evaluating Epoch:  79%|�[32m███████▉  �[0m| 79/100 [01:09<00:19,  1.09it/s]
evaluating Epoch:  79%|�[32m███████▉  �[0m| 79/100 [01:08<00:19,  1.09it/s]
evaluating Epoch:  80%|�[32m████████  �[0m| 80/100 [01:08<00:18,  1.10it/s]
evaluating Epoch:  80%|�[32m████████  �[0m| 80/100 [01:10<00:18,  1.10it/s]
evaluating Epoch:  81%|�[32m████████  �[0m| 81/100 [01:09<00:17,  1.10it/s]
evaluating Epoch:  81%|�[32m████████  �[0m| 81/100 [01:10<00:17,  1.10it/s]
evaluating Epoch:  82%|�[32m████████� �[0m| 82/100 [01:10<00:16,  1.10it/s]
evaluating Epoch:  82%|�[32m████████� �[0m| 82/100 [01:11<00:16,  1.10it/s]
evaluating Epoch:  83%|�[32m████████▎ �[0m| 83/100 [01:12<00:15,  1.10it/s]
evaluating Epoch:  83%|�[32m████████▎ �[0m| 83/100 [01:11<00:15,  1.10it/s]
evaluating Epoch:  84%|�[32m████████� �[0m| 84/100 [01:12<00:14,  1.10it/s]
evaluating Epoch:  84%|�[32m████████� �[0m| 84/100 [01:13<00:14,  1.10it/s]
evaluating Epoch:  85%|�[32m████████▌ �[0m| 85/100 [01:14<00:13,  1.10it/s]
evaluating Epoch:  85%|�[32m████████▌ �[0m| 85/100 [01:13<00:13,  1.09it/s]
evaluating Epoch:  86%|�[32m████████▌ �[0m| 86/100 [01:14<00:12,  1.10it/s]
evaluating Epoch:  86%|�[32m████████▌ �[0m| 86/100 [01:15<00:12,  1.10it/s]
evaluating Epoch:  87%|�[32m████████▋ �[0m| 87/100 [01:16<00:11,  1.10it/s]
evaluating Epoch:  87%|�[32m████████▋ �[0m| 87/100 [01:15<00:11,  1.10it/s]
evaluating Epoch:  88%|�[32m████████▊ �[0m| 88/100 [01:16<00:10,  1.10it/s]
evaluating Epoch:  88%|�[32m████████▊ �[0m| 88/100 [01:17<00:10,  1.10it/s]
evaluating Epoch:  89%|�[32m████████▉ �[0m| 89/100 [01:17<00:09,  1.10it/s]
evaluating Epoch:  89%|�[32m████████▉ �[0m| 89/100 [01:18<00:10,  1.10it/s]
evaluating Epoch:  90%|�[32m█████████ �[0m| 90/100 [01:18<00:09,  1.10it/s]
evaluating Epoch:  90%|�[32m█████████ �[0m| 90/100 [01:19<00:09,  1.10it/s]
evaluating Epoch:  91%|�[32m█████████ �[0m| 91/100 [01:18<00:08,  1.10it/s]
evaluating Epoch:  91%|�[32m█████████ �[0m| 91/100 [01:20<00:08,  1.10it/s]
evaluating Epoch:  92%|�[32m█████████��[0m| 92/100 [01:20<00:07,  1.10it/s]
evaluating Epoch:  92%|�[32m█████████��[0m| 92/100 [01:19<00:07,  1.10it/s]
evaluating Epoch:  93%|�[32m█████████▎�[0m| 93/100 [01:21<00:06,  1.10it/s]
evaluating Epoch:  93%|�[32m█████████▎�[0m| 93/100 [01:20<00:06,  1.10it/s]
evaluating Epoch:  94%|�[32m█████████��[0m| 94/100 [01:21<00:05,  1.11it/s]
evaluating Epoch:  94%|�[32m█████████��[0m| 94/100 [01:22<00:05,  1.11it/s]
evaluating Epoch:  95%|�[32m█████████▌�[0m| 95/100 [01:22<00:04,  1.11it/s]
evaluating Epoch:  95%|�[32m█████████▌�[0m| 95/100 [01:23<00:04,  1.11it/s]
evaluating Epoch:  96%|�[32m█████████▌�[0m| 96/100 [01:23<00:03,  1.11it/s]
evaluating Epoch:  96%|�[32m█████████▌�[0m| 96/100 [01:24<00:03,  1.11it/s]
evaluating Epoch:  97%|�[32m█████████▋�[0m| 97/100 [01:24<00:02,  1.11it/s]
evaluating Epoch:  97%|�[32m█████████▋�[0m| 97/100 [01:25<00:02,  1.10it/s]
evaluating Epoch:  98%|�[32m█████████▊�[0m| 98/100 [01:26<00:01,  1.09it/s]
evaluating Epoch:  98%|�[32m█████████▊�[0m| 98/100 [01:25<00:01,  1.09it/s]
evaluating Epoch:  99%|�[32m█████████▉�[0m| 99/100 [01:26<00:00,  1.09it/s]
evaluating Epoch:  99%|�[32m█████████▉�[0m| 99/100 [01:27<00:00,  1.09it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00,  1.08it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00,  1.08it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00,  1.15it/s]

evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00,  1.13it/s]
 eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')
Epoch 1: train_perplexity=2.8321, train_epoch_loss=1.0410, epoch time 406.24218282848597s
Key: avg_train_prep, Value: 2.8321006298065186
Key: avg_train_loss, Value: 1.0410187244415283
Key: avg_eval_prep, Value: nan
Key: avg_eval_loss, Value: inf
Key: avg_epoch_time, Value: 406.24218282848597
Key: avg_checkpoint_time, Value: 7.697194814682007e-05

Hi, I have encountered the same issue. Did you manage to solve it?

I haven't solved this problem yet, but I guess it's a problem with the dataset. Perhaps some inappropriate data caused the loss to be NaN. Although I have not been successful yet, I think it is possible to prevent such situations by adding some statements to the source code.

BugmakerCC avatar Sep 19 '23 10:09 BugmakerCC

Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from being saved as we set the inital best_eval_loss to Inf.

Here is my log:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-09-18 08:00:02,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-18 08:00:02,256] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:05<00:10,  5.31s/it]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:05<00:10,  5.38s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:10<00:05,  5.19s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:10<00:05,  5.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.16s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.46s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.22s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:13<00:00,  4.50s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
--> Model /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf

--> /GPUFS/nsccgz_ywang_zfd/chenchong/CodeLlama-13b-Instruct-hf has 13016.02816 Million params

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
bFloat16 enabled for mixed precision - using bfSixteen policy
trainable params: 6,553,600 || all params: 13,022,581,760 || trainable%: 0.050324890415585306
--> applying fsdp activation checkpointing...
--> Training Set Length = 6233
--> Validation Set Length = 200
--> applying fsdp activation checkpointing...
/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|�[34m          �[0m| 0/12 [00:00<?, ?it/s]/GPUFS/nsccgz_ywang_zfd/anaconda3/lib/python3.8/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|�[34m          �[0m| 0/12 [00:00<?, ?it/s]
Training Epoch: 1:   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002):   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:40, 36.40s/it]
Training Epoch: 1:   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802):   8%|�[34mâ–Š         �[0m| 1/12 [00:36<06:45, 36.86s/it]
Training Epoch: 1/1, step 0/12 completed (loss: 1.3038674592971802):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:44, 34.42s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:44, 34.42s/it] 
Training Epoch: 1/1, step 0/12 completed (loss: 1.1568443775177002):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:42, 34.26s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143):  17%|�[34m█▋        �[0m| 2/12 [01:09<05:42, 34.26s/it] 
Training Epoch: 1/1, step 1/12 completed (loss: 1.153338074684143):  25%|�[34m██▌       �[0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165):  25%|�[34m██▌       �[0m| 3/12 [01:44<05:14, 34.94s/it]
Training Epoch: 1/1, step 1/12 completed (loss: 1.401218056678772):  25%|�[34m██▌       �[0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344):  25%|�[34m██▌       �[0m| 3/12 [01:45<05:15, 35.05s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.2161827087402344):  33%|�[34m███▎      �[0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268):  33%|�[34m███▎      �[0m| 4/12 [02:18<04:33, 34.20s/it]
Training Epoch: 1/1, step 2/12 completed (loss: 1.1591726541519165):  33%|�[34m███▎      �[0m| 4/12 [02:17<04:33, 34.14s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208):  33%|�[34m███▎      �[0m| 4/12 [02:17<04:33, 34.14s/it] 
Training Epoch: 1/1, step 3/12 completed (loss: 0.978593111038208):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.82s/it]
Training Epoch: 1/1, step 3/12 completed (loss: 1.1507656574249268):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362):  42%|�[34m████�     �[0m| 5/12 [02:51<03:56, 33.85s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.2940857410430908):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.62s/it]
Training Epoch: 1/1, step 4/12 completed (loss: 1.0917633771896362):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293):  50%|�[34m█████     �[0m| 6/12 [03:24<03:21, 33.64s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.0864710807800293):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.46s/it]
Training Epoch: 1/1, step 5/12 completed (loss: 1.1019880771636963):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637):  58%|�[34m█████▊    �[0m| 7/12 [03:57<02:47, 33.45s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 1.1066715717315674):  67%|�[34m██████▋   �[0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295):  67%|�[34m██████▋   �[0m| 8/12 [04:31<02:13, 33.38s/it]
Training Epoch: 1/1, step 6/12 completed (loss: 0.8534350395202637):  67%|�[34m██████▋   �[0m| 8/12 [04:30<02:13, 33.38s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533):  67%|�[34m██████▋   �[0m| 8/12 [04:30<02:13, 33.38s/it] 
Training Epoch: 1/1, step 7/12 completed (loss: 1.2339160442352295):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.39s/it]
Training Epoch: 1/1, step 7/12 completed (loss: 0.809194028377533):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212):  75%|�[34m███████▌  �[0m| 9/12 [05:04<01:40, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 1.0909826755523682):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.38s/it]
Training Epoch: 1/1, step 8/12 completed (loss: 0.8365236520767212):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906):  83%|�[34m████████▎ �[0m| 10/12 [05:37<01:06, 33.37s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.8921104669570923):  92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108):  92%|�[34m█████████��[0m| 11/12 [06:11<00:33, 33.40s/it]
Training Epoch: 1/1, step 9/12 completed (loss: 0.9189796447753906):  92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957):  92%|�[34m█████████��[0m| 11/12 [06:10<00:33, 33.41s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.7444747686386108): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 10/12 completed (loss: 0.8288466334342957): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.39s/it]
Training Epoch: 1/1, step 11/12 completed (loss: 0.8854449391365051): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.72s/it]

Training Epoch: 1/1, step 11/12 completed (loss: 0.6895765662193298): 100%|�[34m██████████�[0m| 12/12 [06:44<00:00, 33.69s/it]
Max CUDA memory allocated was 58 GB
Max CUDA memory reserved was 77 GB
Peak active CUDA memory was 58 GB
Cuda Malloc retires : 35
CPU Total Peak Memory consumed during the train (max): 3 GB

evaluating Epoch:   0%|�[32m          �[0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch:   0%|�[32m          �[0m| 0/100 [00:00<?, ?it/s]
evaluating Epoch:   1%|�[32m          �[0m| 1/100 [00:02<03:25,  2.08s/it]
evaluating Epoch:   1%|�[32m          �[0m| 1/100 [00:01<01:39,  1.00s/it]
evaluating Epoch:   2%|�[32mâ–�         �[0m| 2/100 [00:01<01:26,  1.13it/s]
evaluating Epoch:   2%|�[32mâ–�         �[0m| 2/100 [00:02<02:10,  1.33s/it]
evaluating Epoch:   3%|�[32mâ–Ž         �[0m| 3/100 [00:03<01:45,  1.09s/it]
evaluating Epoch:   3%|�[32mâ–Ž         �[0m| 3/100 [00:02<01:22,  1.18it/s]
evaluating Epoch:   4%|�[32mâ–�         �[0m| 4/100 [00:04<01:33,  1.02it/s]
evaluating Epoch:   4%|�[32mâ–�         �[0m| 4/100 [00:03<01:19,  1.20it/s]
evaluating Epoch:   5%|�[32m▌         �[0m| 5/100 [00:04<01:18,  1.22it/s]
evaluating Epoch:   5%|�[32m▌         �[0m| 5/100 [00:05<01:27,  1.09it/s]
evaluating Epoch:   6%|�[32m▌         �[0m| 6/100 [00:05<01:16,  1.22it/s]
evaluating Epoch:   6%|�[32m▌         �[0m| 6/100 [00:06<01:22,  1.14it/s]
evaluating Epoch:   7%|�[32mâ–‹         �[0m| 7/100 [00:05<01:15,  1.23it/s]
evaluating Epoch:   7%|�[32mâ–‹         �[0m| 7/100 [00:06<01:19,  1.17it/s]
evaluating Epoch:   8%|�[32mâ–Š         �[0m| 8/100 [00:06<01:14,  1.23it/s]
evaluating Epoch:   8%|�[32mâ–Š         �[0m| 8/100 [00:07<01:17,  1.19it/s]
evaluating Epoch:   9%|�[32mâ–‰         �[0m| 9/100 [00:07<01:13,  1.24it/s]
evaluating Epoch:   9%|�[32mâ–‰         �[0m| 9/100 [00:08<01:15,  1.21it/s]
evaluating Epoch:  10%|�[32mâ–ˆ         �[0m| 10/100 [00:09<01:13,  1.22it/s]
evaluating Epoch:  10%|�[32mâ–ˆ         �[0m| 10/100 [00:08<01:12,  1.24it/s]
evaluating Epoch:  11%|�[32mâ–ˆ         �[0m| 11/100 [00:10<01:12,  1.23it/s]
evaluating Epoch:  11%|�[32mâ–ˆ         �[0m| 11/100 [00:09<01:11,  1.24it/s]
evaluating Epoch:  12%|�[32m█�        �[0m| 12/100 [00:09<01:10,  1.24it/s]
evaluating Epoch:  12%|�[32m█�        �[0m| 12/100 [00:10<01:11,  1.23it/s]
evaluating Epoch:  13%|�[32m█▎        �[0m| 13/100 [00:11<01:10,  1.24it/s]
evaluating Epoch:  13%|�[32m█▎        �[0m| 13/100 [00:10<01:09,  1.24it/s]
evaluating Epoch:  14%|�[32m█�        �[0m| 14/100 [00:12<01:09,  1.24it/s]
evaluating Epoch:  14%|�[32m█�        �[0m| 14/100 [00:11<01:09,  1.25it/s]
evaluating Epoch:  15%|�[32m█▌        �[0m| 15/100 [00:13<01:08,  1.24it/s]
evaluating Epoch:  15%|�[32m█▌        �[0m| 15/100 [00:12<01:08,  1.24it/s]
evaluating Epoch:  16%|�[32m█▌        �[0m| 16/100 [00:13<01:07,  1.25it/s]
evaluating Epoch:  16%|�[32m█▌        �[0m| 16/100 [00:14<01:07,  1.24it/s]
evaluating Epoch:  17%|�[32m█▋        �[0m| 17/100 [00:13<01:07,  1.24it/s]
evaluating Epoch:  17%|�[32m█▋        �[0m| 17/100 [00:14<01:07,  1.23it/s]
evaluating Epoch:  18%|�[32m█▊        �[0m| 18/100 [00:14<01:06,  1.24it/s]
evaluating Epoch:  18%|�[32m█▊        �[0m| 18/100 [00:15<01:06,  1.24it/s]
evaluating Epoch:  19%|�[32m█▉        �[0m| 19/100 [00:16<01:05,  1.24it/s]
evaluating Epoch:  19%|�[32m█▉        �[0m| 19/100 [00:15<01:05,  1.24it/s]
evaluating Epoch:  20%|�[32m██        �[0m| 20/100 [00:16<01:04,  1.24it/s]
evaluating Epoch:  20%|�[32m██        �[0m| 20/100 [00:17<01:04,  1.24it/s]
evaluating Epoch:  21%|�[32m██        �[0m| 21/100 [00:17<01:03,  1.24it/s]
evaluating Epoch:  21%|�[32m██        �[0m| 21/100 [00:18<01:03,  1.24it/s]
evaluating Epoch:  22%|�[32m██�       �[0m| 22/100 [00:17<01:02,  1.24it/s]
evaluating Epoch:  22%|�[32m██�       �[0m| 22/100 [00:18<01:02,  1.24it/s]
evaluating Epoch:  23%|�[32m██▎       �[0m| 23/100 [00:19<01:02,  1.24it/s]
evaluating Epoch:  23%|�[32m██▎       �[0m| 23/100 [00:18<01:02,  1.24it/s]
evaluating Epoch:  24%|�[32m██�       �[0m| 24/100 [00:20<01:01,  1.24it/s]
evaluating Epoch:  24%|�[32m██�       �[0m| 24/100 [00:19<01:01,  1.24it/s]
evaluating Epoch:  25%|�[32m██▌       �[0m| 25/100 [00:21<01:00,  1.24it/s]
evaluating Epoch:  25%|�[32m██▌       �[0m| 25/100 [00:20<01:00,  1.24it/s]
evaluating Epoch:  26%|�[32m██▌       �[0m| 26/100 [00:22<00:59,  1.24it/s]
evaluating Epoch:  26%|�[32m██▌       �[0m| 26/100 [00:21<00:59,  1.24it/s]
evaluating Epoch:  27%|�[32m██▋       �[0m| 27/100 [00:21<00:58,  1.24it/s]
evaluating Epoch:  27%|�[32m██▋       �[0m| 27/100 [00:23<00:58,  1.24it/s]
evaluating Epoch:  28%|�[32m██▊       �[0m| 28/100 [00:22<00:57,  1.24it/s]
evaluating Epoch:  28%|�[32m██▊       �[0m| 28/100 [00:23<00:57,  1.24it/s]
evaluating Epoch:  29%|�[32m██▉       �[0m| 29/100 [00:23<00:57,  1.24it/s]
evaluating Epoch:  29%|�[32m██▉       �[0m| 29/100 [00:24<00:57,  1.24it/s]
evaluating Epoch:  30%|�[32m███       �[0m| 30/100 [00:25<00:56,  1.24it/s]
evaluating Epoch:  30%|�[32m███       �[0m| 30/100 [00:24<00:56,  1.24it/s]
evaluating Epoch:  31%|�[32m███       �[0m| 31/100 [00:25<00:55,  1.24it/s]
evaluating Epoch:  31%|�[32m███       �[0m| 31/100 [00:26<00:55,  1.24it/s]
evaluating Epoch:  32%|�[32m███�      �[0m| 32/100 [00:27<00:54,  1.25it/s]
evaluating Epoch:  32%|�[32m███�      �[0m| 32/100 [00:25<00:54,  1.24it/s]
evaluating Epoch:  33%|�[32m███▎      �[0m| 33/100 [00:26<00:53,  1.24it/s]
evaluating Epoch:  33%|�[32m███▎      �[0m| 33/100 [00:27<00:53,  1.24it/s]
evaluating Epoch:  34%|�[32m███�      �[0m| 34/100 [00:28<00:52,  1.25it/s]
evaluating Epoch:  34%|�[32m███�      �[0m| 34/100 [00:27<00:52,  1.25it/s]
evaluating Epoch:  35%|�[32m███▌      �[0m| 35/100 [00:28<00:52,  1.24it/s]
evaluating Epoch:  35%|�[32m███▌      �[0m| 35/100 [00:29<00:52,  1.24it/s]
evaluating Epoch:  36%|�[32m███▌      �[0m| 36/100 [00:30<00:51,  1.25it/s]
evaluating Epoch:  36%|�[32m███▌      �[0m| 36/100 [00:29<00:51,  1.25it/s]
evaluating Epoch:  37%|�[32m███▋      �[0m| 37/100 [00:29<00:50,  1.25it/s]
evaluating Epoch:  37%|�[32m███▋      �[0m| 37/100 [00:31<00:50,  1.25it/s]
evaluating Epoch:  38%|�[32m███▊      �[0m| 38/100 [00:31<00:49,  1.26it/s]
evaluating Epoch:  38%|�[32m███▊      �[0m| 38/100 [00:30<00:49,  1.26it/s]
evaluating Epoch:  39%|�[32m███▉      �[0m| 39/100 [00:31<00:50,  1.21it/s]
evaluating Epoch:  39%|�[32m███▉      �[0m| 39/100 [00:32<00:50,  1.21it/s]
evaluating Epoch:  40%|�[32m████      �[0m| 40/100 [00:32<00:50,  1.18it/s]
evaluating Epoch:  40%|�[32m████      �[0m| 40/100 [00:33<00:50,  1.18it/s]
evaluating Epoch:  41%|�[32m████      �[0m| 41/100 [00:34<00:51,  1.15it/s]
evaluating Epoch:  41%|�[32m████      �[0m| 41/100 [00:33<00:51,  1.15it/s]
evaluating Epoch:  42%|�[32m████�     �[0m| 42/100 [00:35<00:51,  1.12it/s]
evaluating Epoch:  42%|�[32m████�     �[0m| 42/100 [00:34<00:51,  1.12it/s]
evaluating Epoch:  43%|�[32m████▎     �[0m| 43/100 [00:36<00:51,  1.12it/s]
evaluating Epoch:  43%|�[32m████▎     �[0m| 43/100 [00:35<00:51,  1.12it/s]
evaluating Epoch:  44%|�[32m████�     �[0m| 44/100 [00:36<00:50,  1.12it/s]
evaluating Epoch:  44%|�[32m████�     �[0m| 44/100 [00:37<00:50,  1.11it/s]
evaluating Epoch:  45%|�[32m████▌     �[0m| 45/100 [00:37<00:49,  1.11it/s]
evaluating Epoch:  45%|�[32m████▌     �[0m| 45/100 [00:38<00:49,  1.11it/s]
evaluating Epoch:  46%|�[32m████▌     �[0m| 46/100 [00:38<00:49,  1.10it/s]
evaluating Epoch:  46%|�[32m████▌     �[0m| 46/100 [00:39<00:49,  1.10it/s]
evaluating Epoch:  47%|�[32m████▋     �[0m| 47/100 [00:40<00:48,  1.09it/s]
evaluating Epoch:  47%|�[32m████▋     �[0m| 47/100 [00:38<00:48,  1.09it/s]
evaluating Epoch:  48%|�[32m████▊     �[0m| 48/100 [00:39<00:47,  1.10it/s]
evaluating Epoch:  48%|�[32m████▊     �[0m| 48/100 [00:40<00:47,  1.10it/s]
evaluating Epoch:  49%|�[32m████▉     �[0m| 49/100 [00:40<00:46,  1.10it/s]
evaluating Epoch:  49%|�[32m████▉     �[0m| 49/100 [00:41<00:46,  1.10it/s]
evaluating Epoch:  50%|�[32m█████     �[0m| 50/100 [00:41<00:45,  1.11it/s]
evaluating Epoch:  50%|�[32m█████     �[0m| 50/100 [00:42<00:45,  1.11it/s]
evaluating Epoch:  51%|�[32m█████     �[0m| 51/100 [00:43<00:44,  1.10it/s]
evaluating Epoch:  51%|�[32m█████     �[0m| 51/100 [00:42<00:44,  1.10it/s]
evaluating Epoch:  52%|�[32m█████�    �[0m| 52/100 [00:43<00:43,  1.11it/s]
evaluating Epoch:  52%|�[32m█████�    �[0m| 52/100 [00:44<00:43,  1.10it/s]
evaluating Epoch:  53%|�[32m█████▎    �[0m| 53/100 [00:44<00:42,  1.11it/s]
evaluating Epoch:  53%|�[32m█████▎    �[0m| 53/100 [00:45<00:42,  1.11it/s]
evaluating Epoch:  54%|�[32m█████�    �[0m| 54/100 [00:45<00:41,  1.11it/s]
evaluating Epoch:  54%|�[32m█████�    �[0m| 54/100 [00:46<00:41,  1.11it/s]
evaluating Epoch:  55%|�[32m█████▌    �[0m| 55/100 [00:46<00:40,  1.11it/s]
evaluating Epoch:  55%|�[32m█████▌    �[0m| 55/100 [00:47<00:40,  1.11it/s]
evaluating Epoch:  56%|�[32m█████▌    �[0m| 56/100 [00:47<00:39,  1.11it/s]
evaluating Epoch:  56%|�[32m█████▌    �[0m| 56/100 [00:48<00:39,  1.11it/s]
evaluating Epoch:  57%|�[32m█████▋    �[0m| 57/100 [00:49<00:38,  1.11it/s]
evaluating Epoch:  57%|�[32m█████▋    �[0m| 57/100 [00:47<00:38,  1.11it/s]
evaluating Epoch:  58%|�[32m█████▊    �[0m| 58/100 [00:48<00:38,  1.10it/s]
evaluating Epoch:  58%|�[32m█████▊    �[0m| 58/100 [00:49<00:38,  1.10it/s]
evaluating Epoch:  59%|�[32m█████▉    �[0m| 59/100 [00:50<00:37,  1.09it/s]
evaluating Epoch:  59%|�[32m█████▉    �[0m| 59/100 [00:49<00:37,  1.09it/s]
evaluating Epoch:  60%|�[32m██████    �[0m| 60/100 [00:50<00:36,  1.09it/s]
evaluating Epoch:  60%|�[32m██████    �[0m| 60/100 [00:51<00:36,  1.09it/s]
evaluating Epoch:  61%|�[32m██████    �[0m| 61/100 [00:51<00:35,  1.10it/s]
evaluating Epoch:  61%|�[32m██████    �[0m| 61/100 [00:52<00:35,  1.10it/s]
evaluating Epoch:  62%|�[32m██████�   �[0m| 62/100 [00:53<00:34,  1.11it/s]
evaluating Epoch:  62%|�[32m██████�   �[0m| 62/100 [00:52<00:34,  1.11it/s]
evaluating Epoch:  63%|�[32m██████▎   �[0m| 63/100 [00:53<00:33,  1.11it/s]
evaluating Epoch:  63%|�[32m██████▎   �[0m| 63/100 [00:54<00:33,  1.11it/s]
evaluating Epoch:  64%|�[32m██████�   �[0m| 64/100 [00:54<00:32,  1.11it/s]
evaluating Epoch:  64%|�[32m██████�   �[0m| 64/100 [00:55<00:32,  1.11it/s]
evaluating Epoch:  65%|�[32m██████▌   �[0m| 65/100 [00:55<00:31,  1.11it/s]
evaluating Epoch:  65%|�[32m██████▌   �[0m| 65/100 [00:56<00:31,  1.11it/s]
evaluating Epoch:  66%|�[32m██████▌   �[0m| 66/100 [00:56<00:30,  1.11it/s]
evaluating Epoch:  66%|�[32m██████▌   �[0m| 66/100 [00:57<00:30,  1.11it/s]
evaluating Epoch:  67%|�[32m██████▋   �[0m| 67/100 [00:58<00:29,  1.11it/s]
evaluating Epoch:  67%|�[32m██████▋   �[0m| 67/100 [00:57<00:29,  1.11it/s]
evaluating Epoch:  68%|�[32m██████▊   �[0m| 68/100 [00:57<00:28,  1.11it/s]
evaluating Epoch:  68%|�[32m██████▊   �[0m| 68/100 [00:59<00:28,  1.11it/s]
evaluating Epoch:  69%|�[32m██████▉   �[0m| 69/100 [00:58<00:28,  1.10it/s]
evaluating Epoch:  69%|�[32m██████▉   �[0m| 69/100 [00:59<00:28,  1.10it/s]
evaluating Epoch:  70%|�[32m███████   �[0m| 70/100 [01:00<00:27,  1.11it/s]
evaluating Epoch:  70%|�[32m███████   �[0m| 70/100 [00:59<00:27,  1.11it/s]
evaluating Epoch:  71%|�[32m███████   �[0m| 71/100 [01:00<00:26,  1.10it/s]
evaluating Epoch:  71%|�[32m███████   �[0m| 71/100 [01:01<00:26,  1.10it/s]
evaluating Epoch:  72%|�[32m███████�  �[0m| 72/100 [01:01<00:25,  1.11it/s]
evaluating Epoch:  72%|�[32m███████�  �[0m| 72/100 [01:02<00:25,  1.11it/s]
evaluating Epoch:  73%|�[32m███████▎  �[0m| 73/100 [01:03<00:24,  1.09it/s]
evaluating Epoch:  73%|�[32m███████▎  �[0m| 73/100 [01:02<00:24,  1.09it/s]
evaluating Epoch:  74%|�[32m███████�  �[0m| 74/100 [01:04<00:23,  1.09it/s]
evaluating Epoch:  74%|�[32m███████�  �[0m| 74/100 [01:03<00:23,  1.08it/s]
evaluating Epoch:  75%|�[32m███████▌  �[0m| 75/100 [01:05<00:23,  1.08it/s]
evaluating Epoch:  75%|�[32m███████▌  �[0m| 75/100 [01:04<00:23,  1.07it/s]
evaluating Epoch:  76%|�[32m███████▌  �[0m| 76/100 [01:05<00:22,  1.07it/s]
evaluating Epoch:  76%|�[32m███████▌  �[0m| 76/100 [01:06<00:22,  1.07it/s]
evaluating Epoch:  77%|�[32m███████▋  �[0m| 77/100 [01:07<00:21,  1.08it/s]
evaluating Epoch:  77%|�[32m███████▋  �[0m| 77/100 [01:06<00:21,  1.08it/s]
evaluating Epoch:  78%|�[32m███████▊  �[0m| 78/100 [01:08<00:20,  1.09it/s]
evaluating Epoch:  78%|�[32m███████▊  �[0m| 78/100 [01:07<00:20,  1.09it/s]
evaluating Epoch:  79%|�[32m███████▉  �[0m| 79/100 [01:09<00:19,  1.09it/s]
evaluating Epoch:  79%|�[32m███████▉  �[0m| 79/100 [01:08<00:19,  1.09it/s]
evaluating Epoch:  80%|�[32m████████  �[0m| 80/100 [01:08<00:18,  1.10it/s]
evaluating Epoch:  80%|�[32m████████  �[0m| 80/100 [01:10<00:18,  1.10it/s]
evaluating Epoch:  81%|�[32m████████  �[0m| 81/100 [01:09<00:17,  1.10it/s]
evaluating Epoch:  81%|�[32m████████  �[0m| 81/100 [01:10<00:17,  1.10it/s]
evaluating Epoch:  82%|�[32m████████� �[0m| 82/100 [01:10<00:16,  1.10it/s]
evaluating Epoch:  82%|�[32m████████� �[0m| 82/100 [01:11<00:16,  1.10it/s]
evaluating Epoch:  83%|�[32m████████▎ �[0m| 83/100 [01:12<00:15,  1.10it/s]
evaluating Epoch:  83%|�[32m████████▎ �[0m| 83/100 [01:11<00:15,  1.10it/s]
evaluating Epoch:  84%|�[32m████████� �[0m| 84/100 [01:12<00:14,  1.10it/s]
evaluating Epoch:  84%|�[32m████████� �[0m| 84/100 [01:13<00:14,  1.10it/s]
evaluating Epoch:  85%|�[32m████████▌ �[0m| 85/100 [01:14<00:13,  1.10it/s]
evaluating Epoch:  85%|�[32m████████▌ �[0m| 85/100 [01:13<00:13,  1.09it/s]
evaluating Epoch:  86%|�[32m████████▌ �[0m| 86/100 [01:14<00:12,  1.10it/s]
evaluating Epoch:  86%|�[32m████████▌ �[0m| 86/100 [01:15<00:12,  1.10it/s]
evaluating Epoch:  87%|�[32m████████▋ �[0m| 87/100 [01:16<00:11,  1.10it/s]
evaluating Epoch:  87%|�[32m████████▋ �[0m| 87/100 [01:15<00:11,  1.10it/s]
evaluating Epoch:  88%|�[32m████████▊ �[0m| 88/100 [01:16<00:10,  1.10it/s]
evaluating Epoch:  88%|�[32m████████▊ �[0m| 88/100 [01:17<00:10,  1.10it/s]
evaluating Epoch:  89%|�[32m████████▉ �[0m| 89/100 [01:17<00:09,  1.10it/s]
evaluating Epoch:  89%|�[32m████████▉ �[0m| 89/100 [01:18<00:10,  1.10it/s]
evaluating Epoch:  90%|�[32m█████████ �[0m| 90/100 [01:18<00:09,  1.10it/s]
evaluating Epoch:  90%|�[32m█████████ �[0m| 90/100 [01:19<00:09,  1.10it/s]
evaluating Epoch:  91%|�[32m█████████ �[0m| 91/100 [01:18<00:08,  1.10it/s]
evaluating Epoch:  91%|�[32m█████████ �[0m| 91/100 [01:20<00:08,  1.10it/s]
evaluating Epoch:  92%|�[32m█████████��[0m| 92/100 [01:20<00:07,  1.10it/s]
evaluating Epoch:  92%|�[32m█████████��[0m| 92/100 [01:19<00:07,  1.10it/s]
evaluating Epoch:  93%|�[32m█████████▎�[0m| 93/100 [01:21<00:06,  1.10it/s]
evaluating Epoch:  93%|�[32m█████████▎�[0m| 93/100 [01:20<00:06,  1.10it/s]
evaluating Epoch:  94%|�[32m█████████��[0m| 94/100 [01:21<00:05,  1.11it/s]
evaluating Epoch:  94%|�[32m█████████��[0m| 94/100 [01:22<00:05,  1.11it/s]
evaluating Epoch:  95%|�[32m█████████▌�[0m| 95/100 [01:22<00:04,  1.11it/s]
evaluating Epoch:  95%|�[32m█████████▌�[0m| 95/100 [01:23<00:04,  1.11it/s]
evaluating Epoch:  96%|�[32m█████████▌�[0m| 96/100 [01:23<00:03,  1.11it/s]
evaluating Epoch:  96%|�[32m█████████▌�[0m| 96/100 [01:24<00:03,  1.11it/s]
evaluating Epoch:  97%|�[32m█████████▋�[0m| 97/100 [01:24<00:02,  1.11it/s]
evaluating Epoch:  97%|�[32m█████████▋�[0m| 97/100 [01:25<00:02,  1.10it/s]
evaluating Epoch:  98%|�[32m█████████▊�[0m| 98/100 [01:26<00:01,  1.09it/s]
evaluating Epoch:  98%|�[32m█████████▊�[0m| 98/100 [01:25<00:01,  1.09it/s]
evaluating Epoch:  99%|�[32m█████████▉�[0m| 99/100 [01:26<00:00,  1.09it/s]
evaluating Epoch:  99%|�[32m█████████▉�[0m| 99/100 [01:27<00:00,  1.09it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00,  1.08it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00,  1.08it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:27<00:00,  1.15it/s]

evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28<00:00,  1.13it/s]
 eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0')
Epoch 1: train_perplexity=2.8321, train_epoch_loss=1.0410, epoch time 406.24218282848597s
Key: avg_train_prep, Value: 2.8321006298065186
Key: avg_train_loss, Value: 1.0410187244415283
Key: avg_eval_prep, Value: nan
Key: avg_eval_loss, Value: inf
Key: avg_epoch_time, Value: 406.24218282848597
Key: avg_checkpoint_time, Value: 7.697194814682007e-05

Hi, I have encountered the same issue. Did you manage to solve it?

I haven't solved this problem yet, but I guess it's a problem with the dataset. Perhaps some inappropriate data caused the loss to be NaN. Although I have not been successful yet, I think it is possible to prevent such situations by adding some statements to the source code.

yes,you are right.I tried the dataset he provided according to the command: wget - P src/llama_ recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json.And the result was successful.

July-1024 avatar Sep 20 '23 01:09 July-1024

same

lqqyyy avatar Jan 10 '24 17:01 lqqyyy

Hi! It seems that a solution has been provided to the issue and there has not been a follow-up conversation for a long time. I will close this issue for now and feel free to reopen it if you have any questions!

wukaixingxp avatar May 31 '24 17:05 wukaixingxp