grok-1 icon indicating copy to clipboard operation
grok-1 copied to clipboard

Hardware requirements

Open Konard opened this issue 5 months ago • 18 comments

What are minimum and recommended hardware requirements to run the model and to do training?

  1. How much GPU Memory (VRAM) is required?
  2. How much RAM is required?
  3. What GPUs are recommended?
  4. What CPUs are recommended?
  5. May it be run on single machine or cluster is required?

At https://huggingface.co/xai-org/grok-1 it is written:

Due to the large size of the model (314B parameters), a multi-GPU machine is required to test the model with the example code.

What multi-GPU machine mean exactly?

Also looks like that model weights itself are about 296.38 GB, so it would require more than 300 GB of storage. Should it be SSD or HDD will be enough? Does that mean that it also require minimum of 300 GB VRAM?

And at README.md in this repository it is written that: https://github.com/xai-org/grok-1/blob/e50578b5f50e4c10c6e7cff31af1ef2bedb3beb8/README.md?plain=1#L17

What machine with enough GPU memory mean exactly?

Please specify the answer in README.md on both GitHub and Hugging Face, it will save lots of time for people. This answer is required for users to decide is it feasible to run the model using available resources to them.

That would be also useful to keep track of tested hardware, so users will know in advance wherever it is possible to use their hardware without additional problems.

Update 2024-03-19: Looks like we have a confirmation that 8 GPUs are required.

Konard avatar Mar 18 '24 05:03 Konard

many many gpus.

bot66 avatar Mar 18 '24 05:03 bot66

GH200 datacenter rig which cost millions ;)

yhyu13 avatar Mar 18 '24 07:03 yhyu13

If you don't know what 300 GB of VRAM is, you have a lot to learn before trying to run this model.

You need 8 of these.... https://www.amazon.com/NVIDIA-Ampere-Passive-Double-Height/dp/B09N95N3PW

dabeckham avatar Mar 18 '24 07:03 dabeckham

Is Jetson AGX Orin Developer Kit capable of running this monster model?

Martinho0330 avatar Mar 18 '24 08:03 Martinho0330

Question is whether the hardware requirements is an issue that can be fixed? Otherwise in my eyes it would seem that making it "Open source" only means making it available to businesses or in rare cases, individuals with the hardware to run it. Or was it just a publicity thing in relation to the OpenAi lawsuit...

Nick-G1984 avatar Mar 18 '24 09:03 Nick-G1984

looks like magnet download file is soooooo big, 256GB and 2.2% downloaded in progress.

hunter-xue avatar Mar 18 '24 11:03 hunter-xue

@dabeckham I don't know why people are staring it, no one has tested. Just goes viral. This is not release for us but only for Google, Microsoft and AWS etc. Who can provide 300 + GPU memory???

MuhammadShifa avatar Mar 18 '24 12:03 MuhammadShifa

since rtx4090 only has 24GB vram...

xiaosagemisery avatar Mar 18 '24 12:03 xiaosagemisery

@MuhammadShifa It will be possible to run this on the CPU once support is added to llama.cpp and someone releases 4-bit (or lower) quantized weights. You will need around 256 GB RAM, which is a lot more reasonable for a normal user than needing this much VRAM.

david-jk avatar Mar 18 '24 13:03 david-jk

This looks interesting: https://github.com/xai-org/grok-1/issues/42 Speculations could be 96GB of vram if the model can be arranged to work at/with 4-bit quantization for the ggml library. Not sure how nicely ggml plays with jax though.

davidearlyoung avatar Mar 18 '24 16:03 davidearlyoung

Looks like 8 * A100 GPUs with 80 GB VRAM each are not enough by themselves either: https://github.com/xai-org/grok-1/issues/125

Konard avatar Mar 18 '24 17:03 Konard

looks like magnet download file is soooooo big, 256GB and 2.2% downloaded in progress.

Screenshot_20240319_005452

@hunter-xue, did you mean 296 GB?

Konard avatar Mar 18 '24 21:03 Konard

ran it on 8x a100 80g with the code in this repo(no modification, i just added a loop to get input from terminal), using 524GB of vram during single batch inference with nearly no context(10~100 tokens input), speed is only 7 tokens per second #168

jussker avatar Mar 19 '24 07:03 jussker

Can run it on cloud hardware?

SavvyClique avatar Mar 20 '24 03:03 SavvyClique

This looks interesting: #42 Speculations could be 96GB of vram if the model can be arranged to work at/with 4-bit quantization for the ggml library. Not sure how nicely ggml plays with jax though.

Just found this in relations to my last post on this thread: https://huggingface.co/eastwind/grok-1-hf-4bit

Looks to be about 90.2 GB on file if you add up the safetensor file shards from the hugging face eastwind repo. There may be more overhead that requires a bit more memory to use for inference. But promising all the same. I hope that grok-1 quants to 4 bit very well. Fingers crossed.

davidearlyoung avatar Mar 23 '24 23:03 davidearlyoung

Bildschirmfoto 2024-03-28 um 10 41 04 Following this calculation from https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/ you would need

  • 94,2 GB for the 4Bit Model
  • 188,4 GB for the 8Bit Model

I've just stumbled upon this article from VMWare where you can open-source models in the cloud(s): https://www.vmware.com/products/vsphere/ai-ml.html#democratize

stoic-analytics avatar Mar 28 '24 10:03 stoic-analytics

Bildschirmfoto 2024-03-28 um 10 41 04 Following this calculation from https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/ you would need

  • 94,2 GB for the 4Bit Model
  • 188,4 GB for the 8Bit Model

I've just stumbled upon this article from VMWare where you can open-source models in the cloud(s): https://www.vmware.com/products/vsphere/ai-ml.html#democratize

Rough calculations is what I see this formula is for. Which is great for theory and rough plans.

But IRL the actual use of the model will likely have many tiny nuances from many different situations that can add up to change the picture enough to a point where it matters. For either Quantized or straight up open model use at any common float precision datatype.

It's a huge model. And I think that most who are paying attention are curious as a spectator. Which I admit is me as well. This is exciting and interesting stuff!

From what I'm seeing so far from others since my last post, what seems the most reachable, and performant for low mem use, could be roughly about 110 to 120+ GB for quantization. That's just for disk use and when loading the quantized model into mem. (See: https://huggingface.co/Arki05/Grok-1-GGUF for example.) Likely ballooning to a bit more in mem for basic forward passes.

Might be a tight fit for a Apple CPU inference with 128 GB of ram. But still asking a lot.

davidearlyoung avatar Mar 31 '24 18:03 davidearlyoung

@MuhammadShifa It will be possible to run this on the CPU once support is added to llama.cpp and someone releases 4-bit (or lower) quantized weights. You will need around 256 GB RAM, which is a lot more reasonable for a normal user than needing this much VRAM.

The maximum amount of RAM i can squeeze into my AM5 board is 192GB of Ram at this moment. Do you think it is feasible to get it running with this?

Shensen1 avatar May 15 '24 19:05 Shensen1