grok-1 Anyone was able to run this on their PC on Windows?

Trying to run this on Windows, getting tons of errors... Also, what GPU card is the min requirement for this ?

Mar 17 '24 21:03 yarodevuci

apparently it needs 8 GPU (80GB vram each) to run

Mar 17 '24 21:03 nichind

You'll need to download the 300GB weights files using a torrent client if you have the space and time have fun. I would recommend using qBittorrent since it's open-source.

Mar 17 '24 22:03 Bordo2000

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

Mar 17 '24 23:03 LZL0

i have two 3090s and a 4060 and a 3060.... I don't see this happening lol and that 4 4090s you would need fyi :P

Mar 18 '24 00:03 cybershrapnel

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).

Mar 18 '24 00:03 Qu3tzal

80gb each, well fml

Mar 18 '24 00:03 cybershrapnel

so we need to start a project to get this thing broken down and running on petals

Mar 18 '24 00:03 cybershrapnel

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).

Thanks for the correction.

Mar 18 '24 00:03 LZL0

Apparently not.

Mar 18 '24 00:03 Hakureirm

Nice to meet all the active people

Mar 18 '24 01:03 metatron1973

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).

The hardware to run this is nuts. A100 costs around $12k each if you can find one. Just thinking about the PCIe expanders to connect all of them up on a motherboard gives me goosebumps. NVIDIA DGX Station with 4 x A100 GPU starts around $120k Please post if anyone is actually trying this, but this is way beyond most people.

Mar 18 '24 01:03 yongatgithub

I think for a language model at this size, GPU workstations are needed to make it run smoothly. So yeah, you have a pay a lot to run this.

Mar 18 '24 02:03 tommyming

INFO:rank:(1, 256, 6144) INFO:rank:(1, 256, 131072) INFO:rank:State sharding type: <class 'model.TrainingState'> INFO:rank:(1, 256, 6144) INFO:rank:(1, 256, 131072) INFO:rank:Loading checkpoint at ./checkpoints/ckpt-0

Then it says some temp file is in use and crashes.. that temp folder is like another 300GB + size

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'F:\dev\shm\tmpp53ohpcl'

Mar 18 '24 03:03 yarodevuci

TypeError: dynamic_update_slice update shape must be smaller than operand shape, got update shape (1,) for operand shape (0,).

Mar 18 '24 04:03 yarodevuci

How about running it on TPU (v4 or v5)?

Mar 18 '24 05:03 innat

8 x H100（80GB）

Mar 18 '24 05:03 amao12580

How about running it on TPU (v4 or v5)?

How to get TPU for most people?

Mar 18 '24 05:03 Imsovegetable

How about running it on TPU (v4 or v5)?

How to get TPU for most people?

Google Colab Pro+

Mar 18 '24 06:03 Qu3tzal

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).

The hardware to run this is nuts. A100 costs around $12k each if you can find one. Just thinking about the PCIe expanders to connect all of them up on a motherboard gives me goosebumps. NVIDIA DGX Station with 4 x A100 GPU starts around $120k Please post if anyone is actually trying this, but this is way beyond most people.

Yeah I mean 314B is a crazy amount of parameters. This might be useful for big companies that have the resources to use it. It's clearly not aimed at individuals (at least not with current's consumer hardware).

Mar 18 '24 06:03 Qu3tzal

I gave up

Mar 18 '24 08:03 jaysunxiao

I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB

Mar 18 '24 09:03 DarkInsider

I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB

you mean these lines in the run.py?

            # MoE.
            num_experts=8,
            num_selected_experts=2,

Mar 18 '24 15:03 bluevisor

I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB

I think the whole model still needs to be loaded into memory still. Even if routing for inference is only using 2 of the 8 experts for a forward pass.

Mar 18 '24 16:03 davidearlyoung

I gave up

same, it's unreal in the home environment.. and even if we do run it, it's useless, since it's not fine-tuned or anything to use.

Mar 18 '24 17:03 yarodevuci

Grok uses jaxlib. There are no windows wheels for jax + cuda so it is better to stick on linux, unless you want to develop and manually compile jax.

Mar 18 '24 21:03 StavrosD

Grok uses jaxlib. There are no windows wheels for jax + cuda so it is better to stick on linux, unless you want to develop and manually compile jax.

I was curious about jax in regards of playing nice with everything else in the open source community. My knowledge is small in regards of jax in particular. Good to know about this little bit.

I've been watching things unfold here out of curiosity and amusement. Might as well since I can't run it. Knew this was the case the moment I saw the parameter count near announcements.

One side note/thought/musing from me in regard of all this: I was a bit surprised to see jax being used with Grok-1. I'd assumed it likely was going to be in format/framework that worked with/for pytorch. Think that many may have better luck with running Grok once it's moved to pytorch. Which is a more familiar system for the open community.

Mar 19 '24 00:03 davidearlyoung

Nice to meet you all. Have you used it successfully?

Mar 19 '24 01:03 youngmmmqing

I was a bit surprised to see jax being used with Grok-1.

If you're going to rebuild everything from scratch, JAX is an excellent choice. JAX was made to be fast and distributed on big compute power. PyTorch is only catching up recently with compile() and functorch.

Mar 19 '24 03:03 Qu3tzal

I was a bit surprised to see jax being used with Grok-1.

If you're going to rebuild everything from scratch, JAX is an excellent choice. JAX was made to be fast and distributed on big compute power. PyTorch is only catching up recently with compile() and functorch.

Those are interesting points. Thanks for sharing.

It sounds like someone may have grok running, as is from this repo, on multiple A100's. (https://github.com/xai-org/grok-1/discussions/168) Looks to be a bit slow output rate wise. But likely to get better as things publicly become more understood about it.

Mar 19 '24 23:03 davidearlyoung

@yarodevuci

You can enable Linux subsystem for windows and use WSL to run it. It is linux inside windows, you can do everything you want and CUDA can use the GPUs installed in windows.

I just installed the requirements successfully. I cannot load the model because I do not have enough VRAM (just 2 x 24GB). When I run "python run.py" I get the error "ValueError: Number of devices 2 must equal the product of mesh_shape (1, 8)" so I assume that by default you need 8 GPUs to load the model.

Mar 20 '24 17:03 StavrosD

grok-1 grok-1 copied to clipboard

Anyone was able to run this on their PC on Windows?

grok-1
grok-1 copied to clipboard