Carlos Mocholí

Results 427 comments of Carlos Mocholí

Can you share the complete error stacktrace? Are you using `torch==2.0`?

Can you share the output of `pip list | grep torch` and `python -c 'import torch; print(torch.__version__)'`? You might have non-release version that doesn't include that file. Reinstalling torch by...

Another option would be a conversion to HF format (already requested in https://github.com/Lightning-AI/lit-llama/issues/150) since the `ggml` conversion supports it already: https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/convert.py#L594

The format is defined by the nn.Module definition. Since we provide our own implementation, the keys are different.

This has been fixed in lit-gpt: https://github.com/Lightning-AI/lit-gpt

I implemented one in https://github.com/Lightning-AI/lit-stablelm/blob/main/chat.py. It could be copied over to this repository.

@timothylimyl Lit-Parrot supports this via FSDP, added in https://github.com/Lightning-AI/lit-parrot/commit/248d691f06d68c7e92d3230260eda0055f7dc163. Support for this could be easily ported to Lit-LlaMA

Yes, but it would be better if you or somebody else from the community works on the port. The sharding is configured via the `auto_wrap_policy` function used in the commit...

You can `reset_cache` after generation. Lit-GPT does it: https://github.com/Lightning-AI/lit-gpt/blob/main/generate/base.py#L180

You can read about the KV cache here: https://kipp.ly/transformer-inference-arithmetic/ It depends on the sequence length, so if it changes it needs to be reset. When you do inference with a...