Sebastian Raschka
Sebastian Raschka
Thanks, in this case, I'd say it's a feature not a bug!
Hi there, unfortunately, there is no one-size-fits-all solution. Often, the biggest improvement can be made by improving the dataset (collecting more samples, cleaning the data, etc.). Then, algorithm selection and...
Could you provide the concrete code snippets and file paths (and studio names) to illustrate this to @tchaton with a concrete example to follow @sanyalsunny111
Unfortunately I currently don't have the capacity to work on this, but if someone wants to work on it, PRs are very welcome!
Thanks for the note and good point, I didn't know about this. One challenge I see with configuring it in the config file is that it's used to model creation....
Upon reading a bit more, this would only be required for training (due to the optimizer choice). I added it in #1770
Nice summary. I think this touches all the main points. The others (knowledge distillation for the small models; tied embeddings) would not affect the architecture, it's more of a pretraining...
> @Andrei-Aksionov [Sliding window attention (an ugly one, but hey, it works)](https://github.com/Lightning-AI/litgpt/pull/1545/commits/889049df4885cfdfd892ea8f54fa22d3456e5a44) Cool! We can also add that to the existing Mistral/Mixtral models then 😊
> I believe only Mistral v0.1 supported sliding window attention, all the subsequent models by Mistral.ai don't use it. I think you are right. > But after this PR is...
> One more thing. Due to time constraints, I didn't test Gemma v2 27b version. Tests are running fine, but it would be nice to check the generated output. @rasbt...