Carlos Mocholí comments

Results 427 comments of


                                            Carlos Mocholí

Group data by length to reduce wasted computation

It's a well-known trick for training with variable length sequences. Sometimes it can impact the loss as it can impact the i.i.d property of ml training depending on your data....

Resume in the pretraining code

Linked issue request for fine-tuning: https://github.com/Lightning-AI/lit-llama/issues/180

Error while running modified prepare_alpaca.py on Linux Mint 21.1

It's complaining about a missing comma in the json file you are loading. Where did you get this file from? Have you tried downloading it again?

Slice before F.scaled_dot_product_attention() to improve the performance

Hi @mzchtx. What changes are you proposing precisely? `k, v` should already be sliced to the length of `input_pos` with https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/model.py#L217-L218

Slice before F.scaled_dot_product_attention() to improve the performance

@mzchtx Would you like to open a PR with your suggested changes?

Slice before F.scaled_dot_product_attention() to improve the performance

@mzchtx The code would need to be indented under the `if` as before. since this is only relevant for the kv-cache case. Leaving #382 aside, I believe the code should...

Slice before F.scaled_dot_product_attention() to improve the performance

From playing with this, the generated outputs are not the same, meaning that this is not numerically equivalent. However, it's hard to tell if they are worse or just different....

Slice before F.scaled_dot_product_attention() to improve the performance

I stumbled upon this issue: https://github.com/pytorch/pytorch/issues/103082, it might explain the numerical difference.

Use index_copy_ to reduce memory copies

@gkroiz Could this change be detrimental to XLA's performance?

Use index_copy_ to reduce memory copies

@mzchtx Did you measure the performance difference? Would you like to open a PR with your suggestion?