grok-1 icon indicating copy to clipboard operation
grok-1 copied to clipboard

Anyone was able to run this on their PC on Windows?

Open yarodevuci opened this issue 11 months ago • 34 comments

Trying to run this on Windows, getting tons of errors... Also, what GPU card is the min requirement for this ?

yarodevuci avatar Mar 17 '24 21:03 yarodevuci

apparently it needs 8 GPU (80GB vram each) to run

nichind avatar Mar 17 '24 21:03 nichind

You'll need to download the 300GB weights files using a torrent client if you have the space and time have fun. I would recommend using qBittorrent since it's open-source.

Bordo2000 avatar Mar 17 '24 22:03 Bordo2000

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

LZL0 avatar Mar 17 '24 23:03 LZL0

i have two 3090s and a 4060 and a 3060.... I don't see this happening lol and that 4 4090s you would need fyi :P

cybershrapnel avatar Mar 18 '24 00:03 cybershrapnel

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).

Qu3tzal avatar Mar 18 '24 00:03 Qu3tzal

80gb each, well fml

cybershrapnel avatar Mar 18 '24 00:03 cybershrapnel

so we need to start a project to get this thing broken down and running on petals

cybershrapnel avatar Mar 18 '24 00:03 cybershrapnel

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).

Thanks for the correction.

LZL0 avatar Mar 18 '24 00:03 LZL0

Apparently not.

Hakureirm avatar Mar 18 '24 00:03 Hakureirm

Nice to meet all the active people

metatron1973 avatar Mar 18 '24 01:03 metatron1973

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).

The hardware to run this is nuts. A100 costs around $12k each if you can find one. Just thinking about the PCIe expanders to connect all of them up on a motherboard gives me goosebumps. NVIDIA DGX Station with 4 x A100 GPU starts around $120k Please post if anyone is actually trying this, but this is way beyond most people.

yongatgithub avatar Mar 18 '24 01:03 yongatgithub

I think for a language model at this size, GPU workstations are needed to make it run smoothly. So yeah, you have a pay a lot to run this.

tommyming avatar Mar 18 '24 02:03 tommyming

INFO:rank:(1, 256, 6144) INFO:rank:(1, 256, 131072) INFO:rank:State sharding type: <class 'model.TrainingState'> INFO:rank:(1, 256, 6144) INFO:rank:(1, 256, 131072) INFO:rank:Loading checkpoint at ./checkpoints/ckpt-0

Then it says some temp file is in use and crashes.. that temp folder is like another 300GB + size

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'F:\dev\shm\tmpp53ohpcl'

yarodevuci avatar Mar 18 '24 03:03 yarodevuci

TypeError: dynamic_update_slice update shape must be smaller than operand shape, got update shape (1,) for operand shape (0,).

yarodevuci avatar Mar 18 '24 04:03 yarodevuci

How about running it on TPU (v4 or v5)?

innat avatar Mar 18 '24 05:03 innat

8 x H100(80GB)

amao12580 avatar Mar 18 '24 05:03 amao12580

How about running it on TPU (v4 or v5)?

How to get TPU for most people?

Imsovegetable avatar Mar 18 '24 05:03 Imsovegetable

How about running it on TPU (v4 or v5)?

How to get TPU for most people?

Google Colab Pro+

Qu3tzal avatar Mar 18 '24 06:03 Qu3tzal

apparently it needs 8 GPU (80GB vram) to run

Say less. That's 3x 4090s.

No, it's not 80GB VRAM total, it's 8 GPU with 80 GB VRAM each (typically A100s). 4x 4090s, would barely be enough to contain the model's weights in VRAM at 4-bits quant, not run it (that is if not loading/unloading from CPU).

The hardware to run this is nuts. A100 costs around $12k each if you can find one. Just thinking about the PCIe expanders to connect all of them up on a motherboard gives me goosebumps. NVIDIA DGX Station with 4 x A100 GPU starts around $120k Please post if anyone is actually trying this, but this is way beyond most people.

Yeah I mean 314B is a crazy amount of parameters. This might be useful for big companies that have the resources to use it. It's clearly not aimed at individuals (at least not with current's consumer hardware).

Qu3tzal avatar Mar 18 '24 06:03 Qu3tzal

I gave up

jaysunxiao avatar Mar 18 '24 08:03 jaysunxiao

I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB

DarkInsider avatar Mar 18 '24 09:03 DarkInsider

I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB

you mean these lines in the run.py?

            # MoE.
            num_experts=8,
            num_selected_experts=2,

bluevisor avatar Mar 18 '24 15:03 bluevisor

I saw in a tweeter they said it using just 2 out of 8 experts simultaneously, so probably it will not require 300+ GB of VRAM, just 75 or so Also if it will be quantized to lower size might even fit in 24GB

I think the whole model still needs to be loaded into memory still. Even if routing for inference is only using 2 of the 8 experts for a forward pass.

davidearlyoung avatar Mar 18 '24 16:03 davidearlyoung

I gave up

same, it's unreal in the home environment.. and even if we do run it, it's useless, since it's not fine-tuned or anything to use.

yarodevuci avatar Mar 18 '24 17:03 yarodevuci

Grok uses jaxlib. There are no windows wheels for jax + cuda so it is better to stick on linux, unless you want to develop and manually compile jax.

StavrosD avatar Mar 18 '24 21:03 StavrosD

Grok uses jaxlib. There are no windows wheels for jax + cuda so it is better to stick on linux, unless you want to develop and manually compile jax.

I was curious about jax in regards of playing nice with everything else in the open source community. My knowledge is small in regards of jax in particular. Good to know about this little bit.

I've been watching things unfold here out of curiosity and amusement. Might as well since I can't run it. Knew this was the case the moment I saw the parameter count near announcements.

One side note/thought/musing from me in regard of all this: I was a bit surprised to see jax being used with Grok-1. I'd assumed it likely was going to be in format/framework that worked with/for pytorch. Think that many may have better luck with running Grok once it's moved to pytorch. Which is a more familiar system for the open community.

davidearlyoung avatar Mar 19 '24 00:03 davidearlyoung

Nice to meet you all. Have you used it successfully?

youngmmmqing avatar Mar 19 '24 01:03 youngmmmqing

I was a bit surprised to see jax being used with Grok-1.

If you're going to rebuild everything from scratch, JAX is an excellent choice. JAX was made to be fast and distributed on big compute power. PyTorch is only catching up recently with compile() and functorch.

Qu3tzal avatar Mar 19 '24 03:03 Qu3tzal

I was a bit surprised to see jax being used with Grok-1.

If you're going to rebuild everything from scratch, JAX is an excellent choice. JAX was made to be fast and distributed on big compute power. PyTorch is only catching up recently with compile() and functorch.

Those are interesting points. Thanks for sharing.

It sounds like someone may have grok running, as is from this repo, on multiple A100's. (https://github.com/xai-org/grok-1/discussions/168) Looks to be a bit slow output rate wise. But likely to get better as things publicly become more understood about it.

davidearlyoung avatar Mar 19 '24 23:03 davidearlyoung

@yarodevuci

You can enable Linux subsystem for windows and use WSL to run it. It is linux inside windows, you can do everything you want and CUDA can use the GPUs installed in windows.

I just installed the requirements successfully. I cannot load the model because I do not have enough VRAM (just 2 x 24GB). When I run "python run.py" I get the error "ValueError: Number of devices 2 must equal the product of mesh_shape (1, 8)" so I assume that by default you need 8 GPUs to load the model.

StavrosD avatar Mar 20 '24 17:03 StavrosD