grok-1
grok-1 copied to clipboard
Segmentation fault in K8s Pod (8x H100's)
Hi, I am trying to run it, but the python3 ./run.py
process exits eventually after it's been running for about 10 minutes at 800% cpu usage.
I am running it in a K8s pod (with /dev/shm of 640Gi; 58 CPU threads [AMD EPYC 9554]; 1280 Gi RAM) with 8x h100 GPUs.
Not much of the logs:
I can quickly restart the process now as I am in the pod:
pkill gotty
cd /grok-1
gotty -w python3 ./run.py
Ideas?
Commands used to deploy it
apt-get update ; apt-get upgrade -y ;
apt-get install pip wget git -y;
pip install dm_haiku==0.0.12;
pip install jax[cuda12_pip]==0.4.25 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install numpy==1.26.4;
pip install sentencepiece==0.2.0;
pip install -U "huggingface_hub[cli]";
git clone https://github.com/xai-org/grok-1;
wget https://github.com/yudai/gotty/releases/download/v2.0.0-alpha.3/gotty_2.0.0-alpha.3_linux_amd64.tar.gz;
tar -zxvf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; chmod +x gotty ; rm -rf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; mv gotty /usr/local/bin/;
huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/tensor* --local-dir /grok-1/checkpoints --local-dir-use-symlinks False;
mv /grok-1/checkpoints/ckpt /grok-1/checkpoints/ckpt-0;
mkdir /root/shm;
sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;
cd /grok-1 && gotty -w python3 ./run.py;
Update 1
I'm trying to run it directly with python3 ./run.py
(without gotty right now)
什么也看不懂
你真的只是为了写这个才在Github上注册的吗?
Just an observation and I'm probably stating the obvious here, but you should really deploy from the requirements.txt because it will get updated.
And your link with the traceback is super obvious, it's in the last line:
"No space left on device".
Disk is full.
But reading it again something with /dev/shm
but you should really deploy from the requirements.txt because it will get updated.
thanks, I am well aware of that and have used it eventually. But had to use the code someone else pasted as I've been asking to review it first, hence I had to use those pip install'
s
# pip install -r requirements.txt
Requirement already satisfied: dm_haiku==0.0.12 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 1)) (0.0.12)
Requirement already satisfied: jax==0.4.25 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 2)) (0.4.25)
Requirement already satisfied: numpy==1.26.4 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 3)) (1.26.4)
Requirement already satisfied: sentencepiece==0.2.0 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 4)) (0.2.0)
Requirement already satisfied: absl-py>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: jmp>=0.0.2 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.0.4)
Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.9.0)
Requirement already satisfied: flax>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.8.2)
Requirement already satisfied: ml-dtypes>=0.2.0 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (0.3.2)
Requirement already satisfied: opt-einsum in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (3.3.0)
Requirement already satisfied: scipy>=1.9 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (1.12.0)
Requirement already satisfied: msgpack in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.0.8)
Requirement already satisfied: optax in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.2.1)
Requirement already satisfied: orbax-checkpoint in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.5.6)
Requirement already satisfied: tensorstore in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.56)
Requirement already satisfied: rich>=11.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (13.7.1)
Requirement already satisfied: typing-extensions>=4.2 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (4.10.0)
Requirement already satisfied: PyYAML>=5.4.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.0.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.17.2)
Requirement already satisfied: chex>=0.1.7 in /usr/local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.85)
Requirement already satisfied: jaxlib>=0.1.37 in /root/.local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.4.25+cuda12.cudnn89)
Requirement already satisfied: etils[epath,epy] in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.7.0)
Requirement already satisfied: nest_asyncio in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.6.0)
Requirement already satisfied: protobuf in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (5.26.0)
Requirement already satisfied: toolz>=0.9.0 in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.12.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (69.1.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.2)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2024.3.0)
Requirement already satisfied: importlib_resources in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.3.1)
Requirement already satisfied: zipp in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.18.1)
And your link with the traceback is super obvious, it's in the last line: "No space left on device".
The /
disk has 1TiB of space and only about 300 GiB are used;
/dev/shm
is set to 640 GiB (tmpfs).
Also, not sure where did you find the "link with the traceback" you referred to?
I just realised the other issue was actually a linked repository sorry. It just showed on this PR but it's a different repo.
Segfault
I've re-tried it again, and the behavior is same - it Segfaults :/
I guess it's similar issue to https://github.com/xai-org/grok-1/issues/152 now.
Additional info
root@grok-1-596d68d5c7-5cq9f:/app# nvidia-smi -L
GPU 0: NVIDIA H100 PCIe (UUID: GPU-50f0ee14-b7a1-f0af-616a-f3bb0825ee7d)
GPU 1: NVIDIA H100 PCIe (UUID: GPU-17201481-5148-0983-539d-10ff0e2cf07f)
GPU 2: NVIDIA H100 PCIe (UUID: GPU-ce315b98-20ff-34fd-307b-fe05646f5913)
GPU 3: NVIDIA H100 PCIe (UUID: GPU-3c330414-9d82-ef1b-65c1-3dad9f294dd1)
GPU 4: NVIDIA H100 PCIe (UUID: GPU-81c9e219-4831-4d68-ccef-badb7f2bc599)
GPU 5: NVIDIA H100 PCIe (UUID: GPU-102d94be-31e5-5809-da4b-a1eeb5fee45b)
GPU 6: NVIDIA H100 PCIe (UUID: GPU-bc4095a5-f436-2dec-af84-44fe954a7e6c)
GPU 7: NVIDIA H100 PCIe (UUID: GPU-b6b73324-7d54-f3c3-a4d9-27fb98f564e9)
root@grok-1-596d68d5c7-5cq9f:/app# nvidia-smi
Mon Mar 18 20:28:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:00:05.0 Off | 0 |
| N/A 33C P0 48W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 PCIe Off | 00000000:00:06.0 Off | 0 |
| N/A 31C P0 46W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 PCIe Off | 00000000:00:07.0 Off | 0 |
| N/A 40C P0 51W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 PCIe Off | 00000000:00:08.0 Off | 0 |
| N/A 34C P0 47W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 PCIe Off | 00000000:00:09.0 Off | 0 |
| N/A 31C P0 47W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 PCIe Off | 00000000:00:0A.0 Off | 0 |
| N/A 36C P0 47W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 PCIe Off | 00000000:00:0B.0 Off | 0 |
| N/A 30C P0 47W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 PCIe Off | 00000000:00:0C.0 Off | 0 |
| N/A 30C P0 49W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@grok-1-596d68d5c7-5cq9f:/app#
root@grok-1-596d68d5c7-5cq9f:/app# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.5.0-21-generic root=UUID=74dd9370-9caa-470b-a711-0d385161522f ro console=tty1 console=ttyS0
root@grok-1-596d68d5c7-5cq9f:/app# uname -a
Linux grok-1-596d68d5c7-5cq9f 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 9 13:32:52 UTC 2 x86_64 GNU/Linux
root@grok-1-596d68d5c7-5cq9f:/app#
wow, so people that even have the right hardware can't even run this normally? haha that is a FAIL!
Zblocker64
(on Discord) suggested increasing stack size (ulimit -s) which is set to 8192
by default.
I'll try doubling it up to 16384
to see if that helps with the python's segmentation fault.
FWIW: WIndow's the default stack size is 1 MB for 32-bit applications and 4 MB for 64-bit applications. macOS typically defaults to 8 MB (but this depends on the macOS version / and the way app was compiled)
I'll try doubling it up to
16384
to see if that helps with the python's segmentation fault.
Doubling the stack size limit up to 16384
didn't fix the issue.
On the contrary, python run.py
would stop writing any output and it would lock-up immediately, cannot kill it.
It appears the issues with running grok-1 arise mostly when overlay
FS is used in Pod (default FS containers use).
The issues are:
- the
Segmentation fault
when runningpython run.py
; - sometimes it would lock-up the process, it can't be killed unless reboot;
nvidia-smi
or anything that touches the nvidia driver would get locked up too; (we are using latest official nvidia drivers & linux kernel provided for the Ubuntu 22.04)
However, even when running with the ext4
FS directly mounted in the Pod, or even when running on the Host directly:
- the
python run.py
output doesn't seem to be complete as you can see in the screenshots & recordings below (despite exit code being0
);
1. On the host directly:
2. In a Pod (image: ubuntu:22.04
) - grok-1 mounted over ext4
FS:
The /root/grok-1
was mounted directly from the host (ext4
FS) instead of the overlay
FS (!) ; I'm going to test with overlayfs
as I have a hunch it might be the cause of issues) :bulb:
volumeMounts:
- mountPath: /root/grok-1
name: grok-volume
- mountPath: /dev/shm
name: shm
volumes:
- name: grok-volume
hostPath:
path: /root/grok-1
- name: shm
emptyDir:
medium: Memory
sizeLimit: "640Gi"
3. In a Pod (image: ubuntu:22.04
) - overlay
FS
It appears that this issue appears mostly when overlay
FS is used.
volumeMounts:
- mountPath: /dev/shm
name: shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "640Gi"
Next time I ran gdp python
and (gdb) run run.py
process would lockup again at 100% cpu usage.
I could Ctrl+C the gdb and the process would be gone.
However, running gdb /root/grok-1/venv/bin/python
+ (gdb) run run.py
-- process would lockup again at 100% cpu usage and this time I cannot Ctrl+C it nor kill -9 <PID>
; nvidia-smi -L
would print 8 GPUs available on the host, and would just hang instead of exiting as normal.
Only host reboot releases the nvidia.
Versions
Nvidia driver: 550.54.15 Linux: Ubuntu 22.04.4 LTT with 6.5.0-26-generic kernel.
We are using nvidia runtime -- https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.14.5
Update 1: I've tried k8s-device-plugin of 0.15.0-rc.2
version - same issues except that it doesn't seem to be locking the process up. It can be killed, nvidia-smi works well, i.e. isn't locking up anymore. Maybe just luck. Will keep monitoring this.
K8s manifest for 3rd case
Pod with overlay FS
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-pod
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: gpu-app
template:
metadata:
labels:
app: gpu-app
spec:
runtimeClassName: nvidia
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- "node1"
containers:
- name: app
image: ubuntu:22.04
command: ["sleep", "infinity"]
resources:
requests:
cpu: "58"
ephemeral-storage: "1099511627776"
memory: "1374389534720"
nvidia.com/gpu: "8"
limits:
cpu: "58"
ephemeral-storage: "1099511627776"
memory: "1374389534720"
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /dev/shm
name: shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "640Gi"
2. In a Pod (image:
ubuntu:22.04
) - grok-1 mounted overext4
FS:
This time the python process got hung, even with ext4
FS:
and backtrace https://gist.githubusercontent.com/andy108369/b42f07265928ac11a161165f82ce026d/raw/878f644277180efcc1a681217c7dc58230b67c67/backtrace.md
This model should be called SIGINT, because thats what will happen when you try to run it.
This model should be called SIGINT, because thats what will happen when you try to run it
I'm just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?
~~The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container)~~
Update: figured that's what max_len
is for... increasing that, increases the output.
The second issue is that it can't seem to run well in K8s pod, sometimes it runs sometimes it won't. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.
This model should be called SIGINT, because thats what will happen when you try to run it
I'm just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?
The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container)
The second issue is that it can't seem to run well in K8s pod, sometimes it runs sometimes it won't. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.
The user is pulling our leg when they are saying "model should be called SIGINT" they are just making fun of it crashing for them, but not adding anything of value to the ticket.
For whoever needs this:
PyTorch version is working in K8s pod (over container's overlay
FS; and without /dev/shm
requirement) well, no issues!
How-to deploy PyTorch version is here https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953
It looks like the culprit for the process (python using nvidia GPU and nvidia-smi CLI) lockup was the nvidia driver.
If you have H100 GPU and running your provider with the nvidia driver of version 550.X
- make sure you have upgraded it to at least 550.54.15
version which fixes the nvidia driver lockup problem. (where process using nvidia driver would permanently lock-up and nvidia-smi command would permanently hang until server reboot).
Fixed a potential corruption when launching kernels on H100 GPUs, which is more likely to occur when the GPU is shared between multiple processes. This may manifest in XID 13 errors such as Graphics Exception: SKEDCHECK11_TOTAL_THREADS. This issue has no user-controllable workaround and is fixable by updating to driver 550.54.15 or higher. 4537349
Refs. https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-54-15/index.html
Todo
- [x] re-test xai-org's grok-1 in K8s pod (with overlay fs)
In K8s pod (overlay FS) and with the newest:
- nvidia driver 550.54.15
- linux kernel 6.5.0-26-generic
xai-org/grok-1 still:
- sometimes won't print anything, just uses 200% cpu, 84GiB RAM.. until I Ctrl+C it
- when it prints stuff (but not the result), it exits with exit code 139 - Segmentation fault
At least newest nvidia drivers 550.54.15
don't crash/lock-up the processes.
I suggest using PyTorch-based grok-1 version as described here, until then: https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953
At least newest nvidia drivers
550.54.15
don't crash/lock-up the processes.
Unfortunately, that's still not the case with xai-org's grok-1 :/ It still crashes the nvidia drivers and only node reboot fixes this.
stack trace:
different node, but the same problem; pid of the python process (xai-org's grok-1)
root@obl-node2:~# cat /proc/1483740/stack
[<0>] uvm_spin_loop+0xf0/0x180 [nvidia_uvm]
[<0>] wait_for_entry_with_spin+0x4d/0x1c0 [nvidia_uvm]
[<0>] uvm_tracker_wait_for_entry+0x94/0xd0 [nvidia_uvm]
[<0>] uvm_push_end_and_wait+0x3e/0x60 [nvidia_uvm]
[<0>] channel_pool_add.constprop.0+0xa29/0x11c0 [nvidia_uvm]
[<0>] uvm_channel_manager_create+0x3c1/0xb50 [nvidia_uvm]
[<0>] uvm_gpu_retain_by_uuid+0xf45/0x2b30 [nvidia_uvm]
[<0>] uvm_va_space_register_gpu+0x4a/0x7f0 [nvidia_uvm]
[<0>] uvm_api_register_gpu+0x77/0xc0 [nvidia_uvm]
[<0>] uvm_ioctl+0xdfb/0x1cd0 [nvidia_uvm]
[<0>] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
[<0>] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
[<0>] __x64_sys_ioctl+0xa3/0xf0
[<0>] do_syscall_64+0x5b/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
and here is the stack trace for pid 8013 of the nvidia-device-plugin process which was kill -9
'ed but doesn't disappear:
root@obl-node2:~# cat /proc/8013/stack
[<0>] uvm_va_space_destroy+0x482/0x710 [nvidia_uvm]
[<0>] uvm_release.constprop.0+0xa5/0x140 [nvidia_uvm]
[<0>] uvm_release_entry.part.0.isra.0+0x54/0xa0 [nvidia_uvm]
[<0>] uvm_release_entry+0x2e/0x40 [nvidia_uvm]
[<0>] __fput+0xfc/0x2c0
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x61/0xa0
[<0>] do_exit+0x2ac/0x6f0
[<0>] do_group_exit+0x35/0x90
[<0>] get_signal+0x8dc/0x940
[<0>] arch_do_signal_or_restart+0x39/0x120
[<0>] exit_to_user_mode_loop+0x9a/0x130
[<0>] exit_to_user_mode_prepare+0xa5/0xb0
[<0>] syscall_exit_to_user_mode+0x29/0x60
[<0>] do_syscall_64+0x67/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8