Oleg Klimov
Oleg Klimov
First token, 551 prompt: * 1172ms on M1 * 25404ms on Xeon 5315Y I'd say that's the main problem for adoption of this. 551-token prompt isn't even that big, normally...
I tried Starcoder 1b, converted by TabbyML: https://huggingface.co/TabbyML/StarCoder-1B/tree/main/ggml ``` "-m", "starcoder-1b-q8_0.gguf", 897.71 ms / 557 tokens ( 1.61 ms per token, 620.47 tokens per second) 1334.68 ms / 49 runs...
OK it works nicely! So all the credit goes to @ds5t5, right?
@teleprint-me oh I see you've converted the 1.6b model in several quantizations, thank you for that! (I thought your tests were for llama, the name is confusing)
Makes sense!
Interesting! > This approach is better for summarizing and identifying context Are you saying Atom might help to fill model context, to help it to come up with a better...
It's kind of expected, unless we want to hack into the model download process, or parse text output, we don't have a way to forward the progress into the GUI....
We have this PR: https://github.com/smallcloudai/refact/pull/252 that we'll test somehow (don't have any AMD GPUs), or at least we'll set up an auto docker build and someone will test it :D
Maybe when rebooting COMPUTER @psyrtsov wants self-hosting to auto start again?
We have sharding, should be solved! (not yet in docker today)