Carlos Mocholí

Results 427 comments of Carlos Mocholí

Adrian suggests that this is done together with a Studio that includes the pretokenized data

What tokenizer config from huggingface are you trying to load?

When you finetune, you load existing Hugging Face hub weights/tokenizer. LitGPT then copies over the tokenizer into your finetuned output so that it can be loaded in subsequent steps. Did...

Just saw your last message. Looks like it's treating it as a HF tokenizer instead of SentencePiece tokenizer, so this line must be resolving to `False`: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/tokenizer.py#L21

Which --checkpoint_dir did you use with LoRA? I can try to follow the same steps you did to see if I end up with the same error

Surely. It will need to be supported in Lightning first though. @awaelchli already had a look but there are some technical limitations to overcome

> I'm using just one gpu, so I'm initializing the fabric object with In this case, `empty_init=False` is used: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/finetune/lora.py#L170 so initialization should be happening normally

After a brief skim over the HF implementation, I don't see any blockers to support it. Contributions are welcome!

cc @awaelchli, if you'd like to answer

> if a document, article, instruction/output pair exceeds the max sequence length, how is it treated? Depends on the data preparation, but our scripts trim it: https://github.com/Lightning-AI/lit-gpt/blob/0791c52a944f022a5cee91ed1e47288830efb72c/scripts/prepare_alpaca.py#L116-L117 > What about...