Jiaxing Qi （齐家兴） comments

Results 6 comments of


                                            Jiaxing Qi （齐家兴）

Methods to solve or mitigate the AMP training (inference) failing problem

Here is the training loss of **baseline**, i.e. amp training from scratch ![image](https://user-images.githubusercontent.com/20978999/147623254-58c0861a-9e40-4a4a-a7c7-997f1e62a148.png) Here is the training loss of **solution 1**. ![image](https://user-images.githubusercontent.com/20978999/147622890-76020875-06c1-44e7-bfdb-b9740ce663a4.png) Here is the training loss of **solution 2**...

No disk space left while loading llama2-70B for SFT

This is due to the `/tmp` folder inside your container does not have enough space. Because NeMo will untar the `.nemo` file into that folder, for 70B model, it needs...

No disk space left while loading llama2-70B for SFT

By default, there is no `/workspace/result` folder inside NeMo container. Can you try give an existing dir to `exp_manager.explicit_log_dir`

No disk space left while loading llama2-70B for SFT

> Also how can I adapt Tiny Shakespeare dataset? SFT normally requires data to be in style of . But the dataset you mentioned is not this type. Maybe you...

Convert Llama-7b to nemo: checkpoint ckpts are not saved as a model_weights.ckpt file

The converter `convert_llama_hf_to_nemo.py` should produce a `.nemo` file, not a dir. Can you try using NeMo docker image? https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

Can not train llama-7b due to OOM on 40GA100

Add `WANDB_MODE=disabled` before torchrun