phalexo
phalexo
ollama now provides an OpenAI compatible API. So you don't need to use litellm any longer. That said, using local models comes with a different problem, i.e. they don't produce...
If you wanted to do the work, you could probably set up a process pool, which would take work off its own queue and then managing multiple ollama(s) running against...
> I have the same problem, finetuning one step takes me about one hour (8 A800 80G GPUs) I think the problem is that 'accelerate' although distributes weights to different...
p.s. It is NOT actually implementing anything. It is simply dishing out useless advice to do it yourself.
> I try to run the model with a CPU-only python driving file but unfortunately always got failure on making some attemps. And here is my adapted file: > >...
You can create a callback and clear cache every now and then, and maybe do gc.collect(). To improve performance the allocator "refuses" to let cache memory go, i.e. an OOM....
This is what I have in my notus_modelfile: FROM /opt/data/data/TheBloke/notus-7B-v1-GGUF/notus-7b-v1.Q6_K.gguf PARAMETER temperature 1 PARAMETER stop >>>>>>then you run ollama create notus -f notus_modelfile and then ollama run notus or litellm...
Yes, it does exactly this. The only conjecture I have is overrunning its rather small context length of 8K. I have seen this many times using it with gpt-pilot, it...
> Hey guys, it's happening when you hit the context size (which is set to 2048). You can increase the context as a work around w/ `/set parameter num_ctx 8192`...
Likely a bug that was introduced into the later versions. Try 0.1.11 version.