awesome-akash
awesome-akash copied to clipboard
Grok deployment on Akash Network
Grok on Akash Network
Grok repository: https://github.com/xai-org/grok-1
This deployment uses 4 CPU and 8 GPU (using H100 each). If you are trying to use 1 GPU would result to an error. Currently, this deployment requires /dev/shm
to be enabled by the provider or this error will occur:
OSError: [Errno 28] No space left on device: './checkpoints/ckpt-0/tensor00000_000' -> '/dev/shm/tmp238nenvh'
Some modifications:
- Uses jax[cuda12_pip]==0.4.23 instead of jax[cuda12_pip]==0.4.25
- Models downloaded from huggingface instead of torrent for faster download
I'm testing this out
/dev/shm
is part of the checkpoint.py
code.
I think it should be pretty straightforward to sed
/dev/shm
for /root/fake_shm
in the SDL
https://github.com/xai-org/grok-1/blob/e50578b5f50e4c10c6e7cff31af1ef2bedb3beb8/checkpoint.py#L43-L49
alternatively, can probably try mounting a persistent volume over /dev/shm directory.
added a workaround for /dev/shm
(=> /root/shm
):
mkdir /root/shm;
sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;
[WIP] testing updated SDL right now [WIP]
---
version: "2.0"
services:
app:
image: nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
expose:
- port: 8080
as: 80
proto: tcp
to:
- global: true
command:
- bash
- "-c"
args:
- >-
apt-get update ; apt-get upgrade -y ;
apt-get install pip wget git -y;
pip install dm_haiku==0.0.12;
pip install jax[cuda12_pip]==0.4.25 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install numpy==1.26.4;
pip install sentencepiece==0.2.0;
pip install -U "huggingface_hub[cli]";
git clone https://github.com/xai-org/grok-1;
wget https://github.com/yudai/gotty/releases/download/v2.0.0-alpha.3/gotty_2.0.0-alpha.3_linux_amd64.tar.gz;
tar -zxvf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; chmod +x gotty ; rm -rf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; mv gotty /usr/local/bin/;
huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/tensor* --local-dir /grok-1/checkpoints --local-dir-use-symlinks False;
mv /grok-1/checkpoints/ckpt /grok-1/checkpoints/ckpt-0;
mkdir /root/shm;
sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;
cd /grok-1 && gotty -w python3 ./run.py;
sleep infinity
profiles:
compute:
app:
resources:
cpu:
units: 58
memory:
size: 1280Gi
storage:
size: 1024Gi
gpu:
units: 8
attributes:
vendor:
nvidia:
- model: h100
placement:
akash:
pricing:
app:
denom: uakt
amount: 10000000
deployment:
app:
akash:
profile: app
count: 1
Update 1:
- it appears it needs RAM of: [amount of GPU VRAM] x [GPU count] which makes it 80 x 8 = 640 GiB of RAM at least with 8x
h100
's. - it requires at least 300+ GiB of disk space for /grok-1 (and the checkpoints) and at least 15 GiB under /root for huggingface cache.
Would it make more sense to replace ;
with &&
in args
to make the entrypoint pail when a step fails?
Could you also add in the root readme under AI - GPU
like this please? It will make it importable automatically in Cloudmos/Akash Console.
- Grok
Would it make more sense to replace
;
with&&
inargs
to make the entrypoint pail when a step fails?
Thanks! This is just PoC so the code gonna look very dirty; the goal is to make it run first ;) As I am testing it right now it has failed multiple times either due to networking DNS issue or huggingface was unable to get the checkpoints for some period of time.
Looks like no go without the proper /dev/shm
mounted as tmpfs (i.e. mounting a persistent volume won't do) :/
E0318 16:47:55.478626 4452 pjrt_stream_executor_client.cc:2804] Execution of replica 0 failed: INTERNAL: external/xla/xla/service/gpu/nccl_api.cc:501: NCCL operation ncclGroupEnd() failed: unhandled system error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Error while creating shared memory segment /dev/shm/nccl-qp7mSW (size 9637888)'.
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/grok-1/./run.py", line 72, in <module>
main()
File "/grok-1/./run.py", line 67, in main
print(f"Output for prompt: {inp}", sample_from_model(gen, inp, max_len=100, temperature=0.01))
File "/grok-1/runners.py", line 597, in sample_from_model
next(server)
File "/grok-1/runners.py", line 481, in run
rngs, last_output, memory, settings = self.prefill_memory(
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/xla/xla/service/gpu/nccl_api.cc:501: NCCL operation ncclGroupEnd() failed: unhandled system error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Error while creating shared memory segment /dev/shm/nccl-qp7mSW (size 9637888)'.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
Update 3
It seem to be getting past-thru the /dev/shm related error when I update the deployment on the host to support /dev/shm of a larger size.
Not sure if that's the final phase, but it "hangs" at this point of screenshot (uses 800% cpu threads):
FWIW: It needed about 50+ GiB of /dev/shm ; and /dev/shm had to be SHM; not just a filesystem mounted over /dev/shm
Explanation on /dev/shm
In Kubernetes, the default size of /dev/shm
is set to 64MiB
, and currently, there's no direct support to change this.
As a workaround, Kubernetes users can set the Memory
as the medium
for mounting /dev/shm
from the host.
Additionally, they should adjust the sizeLimit
to 50% of the Pod's requested memory limit (the memory
directive in Akash's SDL or resources.limits.memory
in the Kubernetes)
However, this workaround isn't applicable in Akash (Yet!), as it doesn't yet support custom storage settings (emptyDir
->medium: Memory
).
In this case, only Akash Providers have the ability to manually adjust these settings. They can do so if they have access to specific deployment details like the correct Deployment Sequence (DSEQ) and the owner's address.
There are also certain drawbacks of using the emptyDir
, medium: Memory
workaround, see Disadvantages of using empty dir
in https://www.sobyte.net/post/2022-04/k8s-pod-shared-memory/ for more details.
Refs.
- https://github.com/kubernetes/kubernetes/issues/28272#issuecomment-1001441171
- https://www.sobyte.net/post/2022-04/k8s-pod-shared-memory/
Example
volumeMounts:
- mountPath: /dev/shm
name: shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 512Mi # if Pod requests memory of 1024 MiB; I.e. `sizeLimit` should be adjusted to 50% of the Pod's requested memory limit (the `memory` directive in Akash's SDL or `resources.limits.memory` in the Kubernetes)
In Grok case I've used 640Gi sizeLimit because my deployment requested 1280Gi RAM in SDL
Update 4 - python processes exitted eventually
Not much of the logs:
I can quickly restart the process now as I am in the pod:
pkill gotty
cd /grok-1
gotty -w python3 ./run.py
Opened a thread there, seeking for help https://github.com/xai-org/grok-1/issues/164
Upd1: so it segfaults https://github.com/xai-org/grok-1/issues/164#issuecomment-2004922821
PyTorch-based version (~590 GiB) is working on Akash. No need to tweak /dev/shm at all.
Details here: https://github.com/xai-org/grok-1/issues/164#issuecomment-2015507877
How-to here: https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953
SDL I've used:
make sure to change SSH public key to yours. also can probably reduce the CPU down to 8 CPU and RAM down to 32 or 64 GiB (I've seen it was spiking only up to about 26 GiB - the PyTorch version)
---
version: "2.0"
services:
grok-1:
image: ubuntu:22.04
env:
- 'SSH_PUBKEY=ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCyJDv8e1KMytZI+tQTxvEqrAm5TTvNx8E3VM499Yh1vU13F11z5FabgDiYb4n6hIY2tfTf1Wi+6wwd7/xO0cmIaQ9lRXftbR8Bx9sw+tc9oomRulZZx8pxKYFp7m7ETwtPlR4GY7dHboxKu+6yxaBsTyXu4GkSAW/Q9fN3BLZnZavQMQiUPtJ2w65dIScx/OrxY2Ua203wYzTqy2tKGnz9iGK2RZusb/1/JmSoqVRKMuynAp9iB99TL2uqUbQzTqsqRtoplA6DyiFGRkv1cUKNHZFucnmFEEqgwg56tCg+6KC84e3RTOaKh+hWcms3ossJCG1N4n4D6MKLx2zcnjakLDUKwCXH4FsTzv/CMygH2YEEdGlgSQMLkABqyl6J3j0yEOa+F7y+Tqq9wllipGw/SlPf2wLnpN2V6vR/ZVVRXLuWKZ1Crg7y/pYLID5GOwr8Qg/PhOQyfjJCQE0HK/9aKsqPZ4wze0Hp66P3q1LL1d7S221DodYE6PJfnVcogp8= andrey@stealth'
command:
- "sh"
- "-c"
args:
- 'apt-get update;
apt-get install -y --no-install-recommends -- ssh speedtest-cli netcat-openbsd curl wget ca-certificates jq less iproute2 iputils-ping vim bind9-dnsutils nginx;
mkdir -p -m0755 /run/sshd;
mkdir -m700 ~/.ssh;
echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
chmod 0600 ~/.ssh/authorized_keys;
ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
md5sum ~/.ssh/authorized_keys;
exec /usr/sbin/sshd -D'
expose:
- port: 8080
as: 80
to:
- global: true
- port: 22
as: 22
to:
- global: true
profiles:
compute:
grok-1:
resources:
cpu:
units: 128
memory:
size: 1280Gi
storage:
size: 2048Gi
gpu:
units: 8
attributes:
vendor:
nvidia:
- model: h100
placement:
akash:
attributes:
#host: akash
#organization: overclock
pricing:
grok-1:
denom: uakt
amount: 1000000
deployment:
grok-1:
akash:
profile: grok-1
count: 1
that's good. i will also try using this: https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/grok-1
I have updated the SDL but not tested yet because I can't get the bidders when trying to deploy it, will try later. I also have uploaded the dockerfile.
Here is the SDL:
---
version: "2.0"
services:
app:
image: cvpfus/grok-akash:0.6
expose:
- port: 8080
as: 80
proto: tcp
to:
- global: true
profiles:
compute:
app:
resources:
cpu:
units: 64
memory:
size: 640Gi
storage:
size: 2048Gi
gpu:
units: 8
attributes:
vendor:
nvidia:
- model: h100
placement:
akash:
attributes:
host: akash
pricing:
app:
denom: uakt
amount: 1000000
deployment:
app:
akash:
profile: app
count: 1
I'm testing this out
Hi, just to confirm that my twitter handle is cvpfus_id. This is my akash wallet. Thanks!
akash19qhrxhz275t9trslwsp95nz33ry6tlgt8lpgwk
Update
Here is recent SDL I used and sometimes it's working and sometimes it's not (stuck when loading the model). According to this, loading the model takes twice the size of the model in RAM. The model size is about 590GB so maybe increasing the RAM to about 1536Gi might solve that (not tried yet, because I'm not getting the bidders when deploying on Akash, will try when it's back normal).
---
version: "2.0"
services:
app:
image: cvpfus/grok-akash:0.19
env:
- MAX_NEW_TOKENS=100
expose:
- port: 8080
as: 80
proto: tcp
to:
- global: true
profiles:
compute:
app:
resources:
cpu:
units: 64
memory:
size: 640Gi
storage:
size: 2048Gi
gpu:
units: 8
attributes:
vendor:
nvidia:
- model: h100
placement:
akash:
attributes:
host: akash
pricing:
app:
denom: uakt
amount: 1000000
deployment:
app:
akash:
profile: app
count: 1
I recorded when it worked
https://github.com/akash-network/awesome-akash/assets/47532266/faae1fcf-389d-4275-978e-900dad3200af
Please do not use this image (or any xai-org's grok-1 image) on H100's !
It still locks up the latest nvidia drivers 550.54.15
which then forces us to reboot these nodes.
Details https://github.com/xai-org/grok-1/issues/164#issuecomment-2022572399
I'm testing this out
Hi, just to confirm that my twitter handle is cvpfus_id. This is my akash wallet. Thanks!
akash19qhrxhz275t9trslwsp95nz33ry6tlgt8lpgwk
Thank you @yusufpraditya and congrats! Here's you 1,000 AKT