Sebastian Raschka
Sebastian Raschka
@Abecid What error are you getting with bfloat16. I think it's only supported in Ampere and newer, but it appears that it now also works on older T4's and CPU....
While this repository is only focused on the first Llama model to keep the code as simple and readable as possible, we have the [LitGPT repository](https://github.com/Lightning-AI/litgpt) (which is an extension...
I just saw your comment also in https://github.com/Lightning-AI/litgpt/issues/1333. Let's continue the discussion there.
I am not entirely sure, but https://github.com/Lightning-AI/lit-llama/blob/main/scripts/convert_checkpoint.py might be doing that
Yes, full finetuning is supported via [finetune/full.py](https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/full.py) script given a Llama 2 model provided via the `--checkpoint_dir` in Lit-GPT. You can also use a custom dataset given that you prepare...
In general, if you start a new Python session, does ```python import torch print(torch.cuda.is_available()) ``` show `True`?
I don't have a good explanation, but maybe you accidentally set `devices = 1` here?
There might be SLURM (not Lit-LLaMA-specific) problem with requesting the GPUs. You could add the following PyTorch code at the top to see if the machine indeed has multiple GPUs...
Hm, I definitely remember training it ... could you try the following and see if it works? ```python micro_batch_size = 2 ``` or ```python micro_batch_size = 1 ```
It may or may not be related, but are you using `--precision 16-true`? I noticed that for training some models it results in NaNs during training. If your GPU supports...