Carlos Mocholí
Carlos Mocholí
> Thanks, that seems to work. Should I make a merge request on the documentation to clarify that? If you think that would be helpful for others, then go for...
I'm replacing DeepSpeed with FSDP in #118. Feel free to try it out and see if it helps before the PR is merged.
There's two ways to do this. Either having the opposite operations of https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_hf_checkpoint.py#L19-L169 for each of the HuggingFace classes, or creating a HF Transformer model version of `lit_gpt.model`. The former...
Try passing `--precision bf16-mixed` or `--precision 16-mixed`. I just made the switch in the default with #175
Oh yes, you're right. We multiply this number by the world size, so we don't want the number of cores: https://github.com/Lightning-AI/lit-gpt/blob/main/lit_parrot/speed_monitor.py#L223
Did you try reducing your `micro_batch_size`? We have a guide for OOMs in https://github.com/Lightning-AI/lit-gpt/blob/main/howto/oom.md Running `adapter.py` with current main, falcon-7b, precision=16-true, micro_batch_size=1 should use 22.69 GB max allocated memory
Just the 7B model (no training etc) requires 29 GB with mixed precision. 14.5 GB with true half precision. See the math in https://github.com/Lightning-AI/lit-gpt/issues/159#issuecomment-1599820686
NaNs are likely to occur with 16-true precision: https://github.com/Lightning-AI/lit-gpt/issues/291#issuecomment-1645396074
The most recent updates removed the use of this config file. Did you pull `main`?
Did you pull the latest changes? What script did you run, what arguments did you pass? Did you make any changes to the script?