Carlos Mocholí comments

Results 427 comments of


                                            Carlos Mocholí

Generate with batched inputs

@AngainorDev Did you update the token count to account for the fact that it's batched? This line: https://github.com/Lightning-AI/lit-llama/blob/main/generate.py#LL150 It currently computes the length of 1 sequence (T). With batched generation...

Generate with batched inputs

We also don't have batched inference implemented in LlaMA. If this was a problem with the specific architecture of Falcon, you could still check using StableLM or Pythia weights. One...

Test running on MPS / ARM NEON

For reference, GitHub does not offer MPS runners yet (tracked in https://github.com/github/roadmap/issues/528). We would need to self-host it

Test running on MPS / ARM NEON

cc @Borda

Less is more for alignment (LIMA) - adding special EOT token

This is common for instruction/chat tuned models. For instance, `StableLM` does [``, ``, ``](https://github.com/Lightning-AI/lit-parrot/blob/main/chat.py#L173-L179); RedPajama-INCITE does [``, ``](https://github.com/Lightning-AI/lit-parrot/blob/513c3939f6236277a76090428e13fe623fabd075/chat.py#L188) or [`Q:`, `A:`](https://github.com/Lightning-AI/lit-parrot/blob/513c3939f6236277a76090428e13fe623fabd075/chat.py#L198)

Less is more for alignment (LIMA) - adding special EOT token

> The examples above from StableLM are in-vocab sequences and not separate special tokens, correct? They are separate special tokens, you can check by Ctrl+F them in the pre-trained tokenizer...

ImportError when trying to use 'Linear8bitLt' from 'lit_llama.quantization'

It might be caused by https://github.com/TimDettmers/bitsandbytes/issues/544: `pip install scipy` should fix it in that case

Unable to run inference on multiple GPUs

To run larger models split across devices, generation needs to add support for a technique like FSDP. We'll be implementing this soon. In the meantime, you can use quantization to...

Is it possible to further reduce the RAM?

Yes, those are ways to reduce the memory requirement. I will also make a fix soon that enables back flash attention: https://github.com/Lightning-AI/lit-parrot/pull/171

Is it possible to further reduce the RAM?

Yes, would you like to port the changes from that PR here? I can do it otherwise