grok-1 icon indicating copy to clipboard operation
grok-1 copied to clipboard

Run on PC

Open littlecat-dev opened this issue 1 year ago • 22 comments

Maybe stupid question, but how many RAM, VRAM and what processor need to run this :D

littlecat-dev avatar Mar 17 '24 19:03 littlecat-dev

300B parameters so I am not hopeful, I have 64GBs of RAM and doubt I am able to even run this even if I used 16GBs of my VRAM even if quantized to like 1 bit lmao. I would like the older Grok-0 as well to at least have something to play with

nonetrix avatar Mar 17 '24 20:03 nonetrix

~630 GB vram at FP16, maybe 700. Crapshoot on if it'll run on 8 H100s, I don't think you can run it on CPU until it gets gguf'd.

alice-comfy avatar Mar 17 '24 20:03 alice-comfy

I doubt X AI will do it, but when the BitNet code comes out maybe like a 200B version with bitnet would be nice maybe even 120B I think I could run at least one of those since I have already loaded 120B models on this system quantized to hell and back

nonetrix avatar Mar 17 '24 20:03 nonetrix

Would quantizing to .gguf and using a terabyte of RAM help? 🙃

NeuroDonu avatar Mar 17 '24 20:03 NeuroDonu

Would need to wait for GGUF support to be added and merged, once that is done maybe those with 256GBs of RAM might have a chance MAYBE 128GBs but I am doubtful. That is just my guess though from my experience with really bad 120B models created by just merging llama 2 with another llama 2 model by stacking the layers, the good news is that at least since the model is so big the performance still will be pretty good when quantizing it

nonetrix avatar Mar 17 '24 20:03 nonetrix

if TheBloke is still doing model quantization, then you can ask him. I'll eventually try to do this, but I'm not sure it will work out well.

NeuroDonu avatar Mar 17 '24 20:03 NeuroDonu

GGUF support needs to be added, without GGUF support it is a waste of time to even try to attempt unless you feel like writing some C to make it work, which by all means if you can please do definitely isn't meant to discourage that. The model is unknown to GGUF so it has no idea what to even do with it, I don't want you to waste your time

nonetrix avatar Mar 17 '24 20:03 nonetrix

@nonetrix https://github.com/ggerganov/llama.cpp/issues/6120

fakerybakery avatar Mar 17 '24 20:03 fakerybakery

Also #21 maybe we can get the older 33B model at least edit: nope lol

nonetrix avatar Mar 17 '24 20:03 nonetrix

It's 314B int8 parameters, so you would need 314GB of memory to load the model, plus some more for things like the K/V cache

stduhpf avatar Mar 18 '24 01:03 stduhpf

I have a PC with 256G RAM, and I'm waiting for gguf.

rankaiyx avatar Mar 18 '24 03:03 rankaiyx

It’s time to start selecting and purchasing new large memory devices. :D

soulteary avatar Mar 18 '24 03:03 soulteary

My motherboard can only support 64GBs and I've already maxed that, might be able to run out of spec up to 128GBs but it's probably not enough since chipset and CPU supports it gigabytes just says it doesn't. Would have to get a threadripper workstation build just for 0.5 tokens a second image

nonetrix avatar Mar 18 '24 05:03 nonetrix

I hope, we will get exact answers here: https://github.com/xai-org/grok-1/issues/62

konard avatar Mar 18 '24 06:03 konard

I have a PC with 16G RAM, and I'm waiting for gguf.

dockercore avatar Mar 18 '24 06:03 dockercore

Hey if you want a small taste there is a smaller model now fine tuned on this model now, has the same personality as Grok but it's not as smart of course :3

https://huggingface.co/HuggingFaceH4/mistral-7b-grok

nonetrix avatar Mar 20 '24 06:03 nonetrix

Hey if you want a small taste there is a smaller model now fine tuned on this model now, has the same personality as Grok but it's not as smart of course :3

https://huggingface.co/HuggingFaceH4/mistral-7b-grok

Wow, i will try, thanks!

littlecat-dev avatar Mar 20 '24 08:03 littlecat-dev

Only problem is there’s a bug in the dataset so it thinks everything is illegal. Also this model is a base model, not instruct tuned

fakerybakery avatar Mar 20 '24 15:03 fakerybakery

gguf has arrived! The actual measurement of Q3_XS quantization requires 124G memory, which means that a machine with 128G RAM can work!

rankaiyx avatar Mar 25 '24 07:03 rankaiyx

https://huggingface.co/Arki05/Grok-1-GGUF

$ ./main -m ../gguf/grok-1/grok-1-IQ3_XS-split-00001-of-00009.gguf -s 12346 -n 100 -t 32 -p "I believe the meaning of life is"

llm_load_print_meta: model type = 314B llm_load_print_meta: model ftype = IQ3_XS - 3.3 bpw llm_load_print_meta: model params = 316.49 B llm_load_print_meta: model size = 120.73 GiB (3.28 BPW) llm_load_print_meta: general.name = Grok llm_load_print_meta: BOS token = 1 '[BOS]' llm_load_print_meta: EOS token = 2 '[EOS]' llm_load_print_meta: UNK token = 0 '[PAD]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: LF token = 79 '<0x0A>' llm_load_tensors: ggml ctx size = 0.81 MiB llm_load_tensors: CPU buffer size = 16716.66 MiB llm_load_tensors: CPU buffer size = 14592.75 MiB llm_load_tensors: CPU buffer size = 14484.75 MiB llm_load_tensors: CPU buffer size = 14901.35 MiB llm_load_tensors: CPU buffer size = 14714.18 MiB llm_load_tensors: CPU buffer size = 14493.75 MiB llm_load_tensors: CPU buffer size = 14484.75 MiB llm_load_tensors: CPU buffer size = 15250.88 MiB llm_load_tensors: CPU buffer size = 3990.96 MiB

I believe the meaning of life is to be the best you can be and to make a positive difference in the world.

This is the story of how I discovered my life’s purpose and how I was able to make a positive difference to people’s lives.

I was born in 1959, and I have always been a very curious child. I was always interested in the world around me, and I wanted to know how things worked.

My parents encouraged my curiosity, and they bought me a lot llama_print_timings: load time = 75099.36 ms llama_print_timings: sample time = 12.02 ms / 100 runs ( 0.12 ms per token, 8318.08 tokens per second) llama_print_timings: prompt eval time = 5213.81 ms / 7 tokens ( 744.83 ms per token, 1.34 tokens per second) llama_print_timings: eval time = 108333.24 ms / 99 runs ( 1094.28 ms per token, 0.91 tokens per second) llama_print_timings: total time = 113705.85 ms / 106 tokens Log end

$ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 128792 MB node 0 free: 128333 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 129015 MB node 1 free: 4537 MB node distances: node 0 1 0: 10 21 1: 21 10

rankaiyx avatar Mar 25 '24 07:03 rankaiyx

My 256G memory stick (8x DDR3 1866 32G) comes from obsolete server disassembly. In total, they cost me only 640 RMB (about $88).

rankaiyx avatar Mar 25 '24 07:03 rankaiyx

Grok's talk seems to be mixed with something strange.

The department store entrusts the handling company to transport 1000 glass vases, and the freight for each glass vase is 1.50 yuan. If one is broken, this one will not only not pay the freight, but the handling company will also pay 9.50 yuan. The department store finally paid 1456 yuan. How many vases were broken during the handling? The student answered the question: The department store entrusted the handling company to transport 1000 glass vases, and the freight for each glass vase was 1.50 yuan. If one was broken, this one would not only not pay the freight, but the handling company would also pay 9.50 yuan. The department store finally paid 1456 yuan, that is, 1456 yuan - 1000 yuan * 1.50 yuan = 456 yuan, which is equal to 9.50 yuan * n. It can be seen that n = 48, that is, the number of broken glass is 48, and the number of intact glass is 1000-48=952. Mao Zedong's 100th Birthday A big star in the sky, shining the whole universe! A great man of a generation, leading China to the light! How to find the sum of the first n terms of a geometric sequence How to find the sum of the first n terms of a geometric sequence What is the sum of the first 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 llama_print_timings: load time = 1635.08 ms llama_print_timings: sample time = 79.24 ms / 500 runs ( 0.16 ms per token, 6309.79 tokens per second) llama_print_timings: prompt eval time = 45058.73 ms / 83 tokens ( 542.88 ms per token, 1.84 tokens per second) llama_print_timings: eval time = 453951.70 ms / 499 runs ( 909.72 ms per token, 1.10 tokens per second) llama_print_timings: total time = 499935.16 ms / 582 tokens

rankaiyx avatar Mar 25 '24 07:03 rankaiyx