grok-1 icon indicating copy to clipboard operation
grok-1 copied to clipboard

inference error

Open jiangix-paper opened this issue 1 year ago • 4 comments

Hello, when I execute "python run.py", (I use jaxlib-0.4.25 cuda 11.8). it has the error:

File "/grok/grok-1/grok-1-main/runners.py", line 597, in sample_from_model next(server) File "/grok/grok-1/grok-1-main/runners.py", line 481, in run rngs, last_output, memory, settings = self.prefill_memory

jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/xla/xla/service/gpu/nccl_api.cc:395: NCCL operation ncclCommInitRankConfig(&comm_handle, nranks, AsNcclUniqueId(clique_id), ranks[i].rank, &comm_config) failed: invalid argument. Last NCCL warning(error) log entry (may be unrelated) 'Invalid config blocking attribute value -2147483648'.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

jiangix-paper avatar Mar 19 '24 02:03 jiangix-paper

Try this:

pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install -r requirements.txt
python run.py

Info From: https://github.com/xai-org/grok-1/pull/170#issuecomment-2005687344 https://github.com/xai-org/grok-1/issues/24#issuecomment-2002687351

Reasoning: Your version: jaxlib-0.4.25 cuda 11.8 , Required version: jaxlib==0.4.25 cuda12

trholding avatar Mar 19 '24 04:03 trholding

I meet the same problem. My version is cuda 11.8 and jaxlib-0.4.25. It has to be cuda 12??? The recurrence condition is too strict!!!

chenyzh28 avatar Mar 21 '24 03:03 chenyzh28