Carlos Mocholí
Carlos Mocholí
Can you share the complete error stacktrace? Are you using `torch==2.0`?
Can you share the output of `pip list | grep torch` and `python -c 'import torch; print(torch.__version__)'`? You might have non-release version that doesn't include that file. Reinstalling torch by...
Another option would be a conversion to HF format (already requested in https://github.com/Lightning-AI/lit-llama/issues/150) since the `ggml` conversion supports it already: https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/convert.py#L594
The format is defined by the nn.Module definition. Since we provide our own implementation, the keys are different.
This has been fixed in lit-gpt: https://github.com/Lightning-AI/lit-gpt
I implemented one in https://github.com/Lightning-AI/lit-stablelm/blob/main/chat.py. It could be copied over to this repository.
@timothylimyl Lit-Parrot supports this via FSDP, added in https://github.com/Lightning-AI/lit-parrot/commit/248d691f06d68c7e92d3230260eda0055f7dc163. Support for this could be easily ported to Lit-LlaMA
Yes, but it would be better if you or somebody else from the community works on the port. The sharding is configured via the `auto_wrap_policy` function used in the commit...
You can `reset_cache` after generation. Lit-GPT does it: https://github.com/Lightning-AI/lit-gpt/blob/main/generate/base.py#L180
You can read about the KV cache here: https://kipp.ly/transformer-inference-arithmetic/ It depends on the sequence length, so if it changes it needs to be reset. When you do inference with a...