Carlos Mocholí comments

Results 428 comments of


                                            Carlos Mocholí

Build wheels for Python 3.9 and 3.10

Bumping this! I see the are 3.9 docker images published at https://gcr.io/tpu-pytorch/xla so having wheels for them would be a nice next step

Deepspeed and bf16-true

We can choose the precision based on whether deepspeed is used. I guess @awaelchli manually changed the precision value when trying out deepspeed in https://github.com/Lightning-AI/lit-llama/pull/128 (where this code originally comes...

Add chat script for adapter checkpoints

Instead of an `--interactive` flag, it would be better to add a `chat_adapter.py` script that supports it and streaming the output. Since this adds quite a bit of logic, it's...

RoPE precision issue

Adding ```python roped = (x * cos) + (rotated * sin) return roped.type_as(x) ``` Fixes the error above, but the test still fails. Generation looks fine though ```python pytest tests/test_model.py::test_model_bfloat16...

RoPE precision issue

cc @t-vi, maybe you can catch this bug easily as you wrote this test originally

Falcon Loss Not Decreasing During Training

Can you share exactly what script you ran?

Falcon Loss Not Decreasing During Training

I don't think so. But you might need to tweak hyperparameters. This is the dark art of machine learning :wink:

Training time is unexpectedly very slow compared to lit-llama

Do you still see this behaviour, and if so, can you share exactly the code you ran and the arguments passed?

Training time is unexpectedly very slow compared to lit-llama

This is because LLaMA fine-tuning is hardcoded to use `256` max_seq_length: https://github.com/Lightning-AI/lit-llama/blob/main/scripts/prepare_alpaca.py#L26 https://github.com/Lightning-AI/lit-llama/blob/main/finetune/adapter.py#L52 Whereas this repository is configured to use the longest sequence length in alpaca: `1037`. If you override...

Why have a default max_seq_length of 256?

That is a good observation. I agree that we should remove the `max_seq_length` value passed to the model forward and separate it from the max_length used to split the dataset....