Bendr Radrigues
Bendr Radrigues
Above was produced with this commit https://github.com/barsuna/bloomz.cpp/commit/2d0e478c653d078554af0188c90c7081ff0b3059
@Dilip-17 there was same question on another issue, i added some pointers there https://github.com/assafelovic/gpt-researcher/issues/520 the challenge is mostly not how to run, but having the gpu memory necessary to run...
To its credit, llama3 worked pretty much out of box with gpt-researcher (the only tweak needed was the prompt change above). It seems it is possible to stretch the context...
thanks @ElishaKay, indeed the timeout happens during to busy time on the server side (generally during subtopic generation for me). The computer itself is nowhere near overloaded (cpu/mem/io -wise) -...
idk, there is still a risk of timeout (though a lesser one perhaps?), i wonder if there are ways to control the length of the timeout on the gpt-researcher side....
there was a post here https://github.com/assafelovic/gpt-researcher/issues/395 - use lm-studio for llama3 - for embeddings install ollama with some small model (lm-studio had embeddings too, but different api format) i'm using...
(I'm not in any way positioned to implement the quantization support, but wanted to share some notes with those planning to work on it) background: I thought tinygrad example already...
Thank you @AlexCheema! Ack on 1. I realize now these are static numbers. If determining these dynamically, it seems sensible to also establish bus bandwidth and GPU memory bandwidth -...
thank you @AlexCheema ! On 3 - this approch seems limited to 2 processes, we still need something different for when there is >2 instances. I tried to put each...
Hey @assafelovic, it is hardly rational i know - it is an 'operating in a broken world' kind of thing. I found some odd issues where APIs wouldn't handle correctly...