grok-1 icon indicating copy to clipboard operation
grok-1 copied to clipboard

Allows CPU-based execution

Open louiehelm opened this issue 1 year ago • 5 comments

Adds CPU execution to grok-1 model demo

VERY SLOW!

No one should process real world workloads this way.

This is only meant for early dev work by those who don't have 8 x 40GB GPUs

pip install -r requirements-cpu.txt
sed -i 's/USE_CPU_ONLY = False/USE_CPU_ONLY = True/' run.py
python run.py

Still requires:

  • 384GB RAM
  • 1.5 minutes to load into memory
  • 1.1 hours to "compile" grok-1 model
  • 4.2 hours to sample first inference request

Even on a 72 core Xeon Server, these runtimes can require monk-like patience.

So the point isn't to run this end-to-end all day.

It's for developers with high-memory workstations who would rather get this code running slowly than not at all.

Hopefully someone uses this CPU-only workaround early on to bootstrap grok-1 into a more performant model that can eventually be more accessible to a larger pool of devs.

Note: Executing this on most CPUs will emit a series of false warnings about the 8 CPU sub-processes being "stuck". These error messages come from a hardcoded warning within Tensorflow that don't appear to be tuneable or suppressible.

Note 2: If memory usage swells too high, comment out this single line below in checkpoint.py. This reduces peak memory usage from >600GB to closer to ~320GB. The downside is a slightly slower initial load. Adding this "copy_to_shm" load strategy is likely a good time-to-memory trade-off on xAI's server, but may not be on your workstation if it triggers OOM.

def fast_unpickle(path: str) -> Any:
  #  with copy_to_shm(path) as tmp_path:
        with open(path, "rb") as f:
            return pickle.load(f)

louiehelm avatar Mar 20 '24 03:03 louiehelm

Still requires:

  • 384GB RAM
  • 1.5 minutes to load into memory
  • 1.1 hours to "compile" grok-1 model
  • 4.2 hours to sample first inference request

Could you add your systems specs here?

I'll add it to: https://github.com/xai-org/grok-1/issues/42 and https://github.com/xai-org/grok-1/discussions/183

trholding avatar Mar 24 '24 03:03 trholding

Still requires:

  • 384GB RAM
  • 1.5 minutes to load into memory
  • 1.1 hours to "compile" grok-1 model
  • 4.2 hours to sample first inference request

Could you add your systems specs here?

I'll add it to: #42 and #183

CPU: 2 x Intel Xeon E5-2697 v4 Total RAM: 1.5TB RAM

louiehelm avatar Mar 24 '24 07:03 louiehelm

I'm not sure why I got this error? INFO:rank:(1, 256, 6144) INFO:rank:(1, 256, 131072) INFO:rank:State sharding type: <class 'model.TrainingState'> INFO:rank:(1, 256, 6144) INFO:rank:(1, 256, 131072) INFO:rank:Loading checkpoint at ./checkpoints/ckpt-0 INFO:rank:(1, 8192, 6144) INFO:rank:(1, 8192, 131072) Output for prompt: The answer to life the universe and everything is of course INFO:runners:Precompile 1024 INFO:rank:(1, 1, 6144) INFO:rank:(1, 1, 131072) INFO:runners:Compiling... INFO:rank:(1, 1, 6144) INFO:rank:(1, 1, 131072) jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: unsupported operand type BF16 in op dot

I'm using Xeon 5320 + 1TB RAM. install the software using the requirement-cpu.txt

inkoil avatar Apr 03 '24 09:04 inkoil

I'm not sure why I got this error?

...

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: unsupported operand type BF16 in op dot

I'm using Xeon 5320 + 1TB RAM. install the software using the requirement-cpu.txt

I assume you included my changes in run.py too? And changed "USE_CPU_ONLY = False" to "USE_CPU_ONLY = True"?

Hopefully this repository isn't abandoned but it doesn't seem like anyone is maintaining it anymore.

You might be better off running grok-1 in llama.cpp if JAX is crashing for you.

louiehelm avatar Apr 04 '24 01:04 louiehelm

For all those who read this and are struggleing but want to run this model once, here is an article on how I managed to get it run for less than $10.

If you want to test things, you might be better off using the more expensive GCP version because it offers the possiblity to be stopped and then you only pay for storage.

I hope someone finds it helpful.

Article: https://twitter.com/PascalBauerDE/status/1776792056452546822 Fork: https://github.com/pafend/grok-1-brev

pafend avatar Apr 07 '24 08:04 pafend