grok-1
grok-1 copied to clipboard
Anyone was able to run this on their PC on Windows?
Trying to run this on Windows, getting tons of errors... Also, what GPU card is the min requirement for this ?
apparently it needs 8 GPU (80GB vram each) to run
You'll need to download the 300GB weights files using a torrent client if you have the space and time have fun. I would recommend using qBittorrent since it's open-source.
apparently it needs 8 GPU (80GB vram) to run
Say less. That's 3x 4090s.
i have two 3090s and a 4060 and a 3060.... I don't see this happening lol and that 4 4090s you would need fyi :P
apparently it needs 8 GPU (80GB vram) to run
Say less. That's 3x 4090s.
No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).
80gb each, well fml
so we need to start a project to get this thing broken down and running on petals
apparently it needs 8 GPU (80GB vram) to run
Say less. That's 3x 4090s.
No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).
Thanks for the correction.
Apparently not.
Nice to meet all the active people
apparently it needs 8 GPU (80GB vram) to run
Say less. That's 3x 4090s.
No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).
The hardware to run this is nuts. A100 costs around $12k each if you can find one. Just thinking about the PCIe expanders to connect all of them up on a motherboard gives me goosebumps. NVIDIA DGX Station with 4 x A100 GPU starts around $120k Please post if anyone is actually trying this, but this is way beyond most people.
I think for a language model at this size, GPU workstations are needed to make it run smoothly. So yeah, you have a pay a lot to run this.
INFO:rank:(1, 256, 6144) INFO:rank:(1, 256, 131072) INFO:rank:State sharding type: <class 'model.TrainingState'> INFO:rank:(1, 256, 6144) INFO:rank:(1, 256, 131072) INFO:rank:Loading checkpoint at ./checkpoints/ckpt-0
Then it says some temp file is in use and crashes.. that temp folder is like another 300GB + size
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'F:\dev\shm\tmpp53ohpcl'
TypeError: dynamic_update_slice update shape must be smaller than operand shape, got update shape (1,) for operand shape (0,).
How about running it on TPU (v4 or v5)?
8 x H100(80GB)
How about running it on TPU (v4 or v5)?
How to get TPU for most people?
How about running it on TPU (v4 or v5)?
How to get TPU for most people?
Google Colab Pro+
apparently it needs 8 GPU (80GB vram) to run
Say less. That's 3x 4090s.
No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).
The hardware to run this is nuts. A100 costs around $12k each if you can find one. Just thinking about the PCIe expanders to connect all of them up on a motherboard gives me goosebumps. NVIDIA DGX Station with 4 x A100 GPU starts around $120k Please post if anyone is actually trying this, but this is way beyond most people.
Yeah I mean 314B is a crazy amount of parameters. This might be useful for big companies that have the resources to use it. It's clearly not aimed at individuals (at least not with current's consumer hardware).
I gave up
I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB
I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB
you mean these lines in the run.py?
# MoE.
num_experts=8,
num_selected_experts=2,
I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB
I think the whole model still needs to be loaded into memory still. Even if routing for inference is only using 2 of the 8 experts for a forward pass.
I gave up
same, it's unreal in the home environment.. and even if we do run it, it's useless, since it's not fine-tuned or anything to use.
Grok uses jaxlib. There are no windows wheels for jax + cuda so it is better to stick on linux, unless you want to develop and manually compile jax.
Grok uses jaxlib. There are no windows wheels for jax + cuda so it is better to stick on linux, unless you want to develop and manually compile jax.
I was curious about jax in regards of playing nice with everything else in the open source community. My knowledge is small in regards of jax in particular. Good to know about this little bit.
I've been watching things unfold here out of curiosity and amusement. Might as well since I can't run it. Knew this was the case the moment I saw the parameter count near announcements.
One side note/thought/musing from me in regard of all this: I was a bit surprised to see jax being used with Grok-1. I'd assumed it likely was going to be in format/framework that worked with/for pytorch. Think that many may have better luck with running Grok once it's moved to pytorch. Which is a more familiar system for the open community.
Nice to meet you all. Have you used it successfully?
I was a bit surprised to see jax being used with Grok-1.
If you're going to rebuild everything from scratch, JAX is an excellent choice. JAX was made to be fast and distributed on big compute power. PyTorch is only catching up recently with compile()
and functorch
.
I was a bit surprised to see jax being used with Grok-1.
If you're going to rebuild everything from scratch, JAX is an excellent choice. JAX was made to be fast and distributed on big compute power. PyTorch is only catching up recently with
compile()
andfunctorch
.
Those are interesting points. Thanks for sharing.
It sounds like someone may have grok running, as is from this repo, on multiple A100's. (https://github.com/xai-org/grok-1/discussions/168) Looks to be a bit slow output rate wise. But likely to get better as things publicly become more understood about it.
@yarodevuci
You can enable Linux subsystem for windows and use WSL to run it. It is linux inside windows, you can do everything you want and CUDA can use the GPUs installed in windows.
I just installed the requirements successfully. I cannot load the model because I do not have enough VRAM (just 2 x 24GB). When I run "python run.py" I get the error "ValueError: Number of devices 2 must equal the product of mesh_shape (1, 8)" so I assume that by default you need 8 GPUs to load the model.