Dillon Roach
Dillon Roach
I'm going to take some time and give this and 45 a whirl - just wanted to put my name in the hat so I don't duplicate effort from somebody...
I'm going to take some time and give this and #322 a whirl - just wanted to put my name in the hat so I don't duplicate effort from somebody...
@marcelotrevisani - for the case where toml version pinnings don't match with previously built dicts, what's your preferred way to resolve? take previous/new info as truth and overwrite? overwrite with...
For what it's worth, https://github.com/qwopqwop200/GPTQ-for-LLaMa and https://github.com/PanQiWei/AutoGPTQ seem to be the most common mentions from folks posting quantized models on huggingface lately - the later more just for general use....
At the same time, I'd be happy to get this added to conda-forge so it's available there. One thing that could help for both - if you could tag a...
@kanttouchthis you're asking the tts to do a lot of extra stuff it doesn't need to every time by making the call via tts.tts_to_file() Here's a short-hand reference implementation of...
@kanttouchthis yep, most of the big speed difference is from deepspeed; the other smaller chunk is likely the re-compute of the latents and embeddings when doing the 'clone' each time,...
I'll just keep my comments at the higher LLM-interaction level as I have less opinion about how ragna should do it specifically, but with that said: - flexibility should be...
WIP: https://github.com/Quansight/ragna/pull/432 Currently suggests adding chat.generate(), which calls Assistant.generate(), as equivalent to answer without returning a Message and with no sources/logging. This way an Assistant.answer() might call a preprocess routine,...
Comments make sense to me; Re streaming: I'm not against the idea, but what's the use-case for streaming specifically on the generate() part of the API? Most use-cases will be...