Carlos Mocholí

Results 427 comments of Carlos Mocholí

@AngainorDev Did you update the token count to account for the fact that it's batched? This line: https://github.com/Lightning-AI/lit-llama/blob/main/generate.py#LL150 It currently computes the length of 1 sequence (T). With batched generation...

We also don't have batched inference implemented in LlaMA. If this was a problem with the specific architecture of Falcon, you could still check using StableLM or Pythia weights. One...

For reference, GitHub does not offer MPS runners yet (tracked in https://github.com/github/roadmap/issues/528). We would need to self-host it

This is common for instruction/chat tuned models. For instance, `StableLM` does [``, ``, ``](https://github.com/Lightning-AI/lit-parrot/blob/main/chat.py#L173-L179); RedPajama-INCITE does [``, ``](https://github.com/Lightning-AI/lit-parrot/blob/513c3939f6236277a76090428e13fe623fabd075/chat.py#L188) or [`Q:`, `A:`](https://github.com/Lightning-AI/lit-parrot/blob/513c3939f6236277a76090428e13fe623fabd075/chat.py#L198)

> The examples above from StableLM are in-vocab sequences and not separate special tokens, correct? They are separate special tokens, you can check by Ctrl+F them in the pre-trained tokenizer...

It might be caused by https://github.com/TimDettmers/bitsandbytes/issues/544: `pip install scipy` should fix it in that case

To run larger models split across devices, generation needs to add support for a technique like FSDP. We'll be implementing this soon. In the meantime, you can use quantization to...

Yes, those are ways to reduce the memory requirement. I will also make a fix soon that enables back flash attention: https://github.com/Lightning-AI/lit-parrot/pull/171

Yes, would you like to port the changes from that PR here? I can do it otherwise