LaaZa
LaaZa
Also I would like to add that the checkbox is too much I think, I would hazard a guess that there are much less people "inconvenienced" by having to press...
> Can't we have both. So a check box if you want the model to load without intervention and a load button if you disable the check box. If ooba...
@oobabooga I guess that works. I would prefer it to be off by default though and I mean without changing the settings, I think the manual loading is more beneficial...
I happened to test this when I made a small sharded model to test the loading. Seemed to work fine, but I didn't do any comprehensive testing.
Llama 3 is a new model so maybe the fused attention does not work with it. It can be enabled because the model identifies itself as `llama`.
The PR does not fix that issue, someone just mentions the issue. I don't know if it supposed to be fixed elsewhere. Only 70B used GQA so it did not...
I'm trying to see if I can get it updated and working.
Okay so the situation is the following: Fused attention does not seem to work at all due to transformers changes and especially the cache, it might work without cache but...
It's not that it would be worse, but a lot more hassle for no real benefit. Also all of the kernels are focused on 4-bit. You can just load a...
There is no specific feature to edit the UI apart from the different modes `chat, notebook and default` You have to edit the UI code manually in the python files,...