distributed-llama
distributed-llama copied to clipboard
(Crashing on Low Memory SBC) main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
Is there anyway that main and worker could be separated so I can use a cluster of 8 RPi 3b+ for the compute but the scheduling is offset to another device with more memory? I understand this is most likely not a priority. Perhaps a smaller model? https://github.com/jzhang38/TinyLlama ?
main:
ubuntu@ubuntu:~/distributed-llama$ sudo main chat --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --model ~/dllama_meta-lla
ma-3-8b_q40.bin --tokenizer ~/dllama-llama3-tokenizer.t --workers 192.168.2.212:9998 192.168.2.213:9998 192.168.2.214:9998 192.168.2.215:
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 8
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
Killed
Worker
ubuntu@ubuntu:~$ sudo nice -n -20 main worker --port 9998 --nthreads 4]
Listening on 0.0.0.0:9998...
Client connected
terminate called after throwing an instance of 'ReadSocketException'
what(): std::exception
Aborted
May 19 08:46:24 ubuntu kernel: [107061.602328] main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
May 19 08:46:24 ubuntu kernel: [107061.602392] CPU: 0 PID: 4676 Comm: main Tainted: G C E 5.15.0-1055-raspi #58-Ubuntu
May 19 08:46:24 ubuntu kernel: [107061.602412] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
May 19 08:46:24 ubuntu kernel: [107061.602423] Call trace:
May 19 08:46:24 ubuntu kernel: [107061.602430] dump_backtrace+0x0/0x200
May 19 08:46:24 ubuntu kernel: [107061.602455] show_stack+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602470] dump_stack_lvl+0x8c/0xb8
May 19 08:46:24 ubuntu kernel: [107061.602490] dump_stack+0x18/0x34
May 19 08:46:24 ubuntu kernel: [107061.602506] dump_header+0x54/0x21c
May 19 08:46:24 ubuntu kernel: [107061.602520] oom_kill_process+0x22c/0x230
May 19 08:46:24 ubuntu kernel: [107061.602539] out_of_memory+0xf4/0x370
May 19 08:46:24 ubuntu kernel: [107061.602554] __alloc_pages_slowpath.constprop.0+0x604/0x8e0
May 19 08:46:24 ubuntu kernel: [107061.602574] __alloc_pages+0x29c/0x320
May 19 08:46:24 ubuntu kernel: [107061.602590] alloc_zeroed_user_highpage_movable+0x40/0x50
May 19 08:46:24 ubuntu kernel: [107061.602607] do_anonymous_page+0x88/0x4ec
May 19 08:46:24 ubuntu kernel: [107061.602628] handle_pte_fault+0x170/0x1c0
May 19 08:46:24 ubuntu kernel: [107061.602642] __handle_mm_fault+0x1d0/0x350
May 19 08:46:24 ubuntu kernel: [107061.602655] handle_mm_fault+0x108/0x294
May 19 08:46:24 ubuntu kernel: [107061.602669] faultin_page+0x84/0x150
May 19 08:46:24 ubuntu kernel: [107061.602685] __get_user_pages+0x194/0x2c0
May 19 08:46:24 ubuntu kernel: [107061.602701] populate_vma_page_range+0x64/0x70
May 19 08:46:24 ubuntu kernel: [107061.602719] __mm_populate+0xc4/0x1d0
May 19 08:46:24 ubuntu kernel: [107061.602735] do_mlock+0xdc/0x26c
May 19 08:46:24 ubuntu kernel: [107061.602750] __arm64_sys_mlock+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602765] invoke_syscall+0x50/0x120
May 19 08:46:24 ubuntu kernel: [107061.602784] el0_svc_common.constprop.0+0x6c/0x1a0
May 19 08:46:24 ubuntu kernel: [107061.602803] do_el0_svc+0x30/0xb0
May 19 08:46:24 ubuntu kernel: [107061.602820] el0_svc+0x4c/0x170
May 19 08:46:24 ubuntu kernel: [107061.602837] el0t_64_sync_handler+0xa4/0x130
May 19 08:46:24 ubuntu kernel: [107061.602854] el0t_64_sync+0x1a4/0x1a8
May 19 08:46:24 ubuntu kernel: [107061.602888] Mem-Info:
May 19 08:46:24 ubuntu kernel: [107061.602905] active_anon:735 inactive_anon:16569 isolated_anon:0
May 19 08:46:24 ubuntu kernel: [107061.602905] active_file:36 inactive_file:28 isolated_file:0
May 19 08:46:24 ubuntu kernel: [107061.602905] unevictable:185356 dirty:0 writeback:0
May 19 08:46:24 ubuntu kernel: [107061.602905] slab_reclaimable:6070 slab_unreclaimable:10550
May 19 08:46:24 ubuntu kernel: [107061.602905] mapped:1869 shmem:749 pagetables:923 bounce:0
May 19 08:46:24 ubuntu kernel: [107061.602905] kernel_misc_reclaimable:0
May 19 08:46:24 ubuntu kernel: [107061.602905] free:5609 free_pcp:0 free_cma:0
May 19 08:46:24 ubuntu kernel: [107061.602949] Node 0 active_anon:2940kB inactive_anon:66276kB active_file:144kB inactive_file:112kB unevictable:741424kB isolated(anon):0kB isolated(file):0kB mapped:7476kB dirty:0kB writeback:0kB shmem:2996kB >May 19 08:46:24 ubuntu kernel: [107061.602992] DMA free:22436kB min:24576kB low:30208kB high:35840kB reserved_highatomic:0KB active_anon:2940kB inactive_anon:66276kB active_file:196kB inactive_file:292kB unevictable:741332kB writepending:0kB p>May 19 08:46:24 ubuntu kernel: [107061.603035] lowmem_reserve[]: 0 0 0 0
May 19 08:46:24 ubuntu kernel: [107061.603114] DMA: 1113*4kB (UME) 633*8kB (UME) 296*16kB (UME) 129*32kB (UME) 48*64kB (UME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22860kB
May 19 08:46:24 ubuntu kernel: [107061.603406] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 19 08:46:24 ubuntu kernel: [107061.603428] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
May 19 08:46:24 ubuntu kernel: [107061.603449] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 19 08:46:24 ubuntu kernel: [107061.603469] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
May 19 08:46:24 ubuntu kernel: [107061.603489] 2704 total pagecache pages
May 19 08:46:24 ubuntu kernel: [107061.603504] 0 pages in swap cache
May 19 08:46:24 ubuntu kernel: [107061.603518] Swap cache stats: add 0, delete 0, find 0/0
May 19 08:46:24 ubuntu kernel: [107061.603536] Free swap = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603550] Total swap = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603565] 242688 pages RAM
May 19 08:46:24 ubuntu kernel: [107061.603580] 0 pages HighMem/MovableOnly
May 19 08:46:24 ubuntu kernel: [107061.603594] 10931 pages reserved
May 19 08:46:24 ubuntu kernel: [107061.603609] 16384 pages cma reserved
May 19 08:46:24 ubuntu kernel: [107061.603624] Tasks state (memory values in pages):
May 19 08:46:24 ubuntu kernel: [107061.603638] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
May 19 08:46:24 ubuntu kernel: [107061.603685] [ 379] 0 379 12038 852 94208 0 -250 systemd-journal
May 19 08:46:24 ubuntu kernel: [107061.603716] [ 406] 0 406 72414 6415 118784 0 -1000 multipathd
May 19 08:46:24 ubuntu kernel: [107061.603745] [ 420] 0 420 5982 942 69632 0 -1000 systemd-udevd
May 19 08:46:24 ubuntu kernel: [107061.603789] [ 553] 103 553 22163 732 77824 0 0 systemd-timesyn
May 19 08:46:24 ubuntu kernel: [107061.603819] [ 612] 100 612 4068 777 73728 0 0 systemd-network
May 19 08:46:24 ubuntu kernel: [107061.603847] [ 614] 101 614 6339 1633 90112 0 0 systemd-resolve
May 19 08:46:24 ubuntu kernel: [107061.603875] [ 625] 102 625 2267 838 57344 0 -900 dbus-daemon
May 19 08:46:24 ubuntu kernel: [107061.603904] [ 629] 0 629 20487 611 65536 0 0 irqbalance
May 19 08:46:24 ubuntu kernel: [107061.603933] [ 634] 0 634 8236 2733 114688 0 0 networkd-dispat
May 19 08:46:24 ubuntu kernel: [107061.603961] [ 640] 104 640 55504 826 81920 0 0 rsyslogd
May 19 08:46:24 ubuntu kernel: [107061.603989] [ 644] 0 644 366640 2855 249856 0 -900 snapd
May 19 08:46:24 ubuntu kernel: [107061.604017] [ 653] 0 653 3887 791 69632 0 0 systemd-logind
May 19 08:46:24 ubuntu kernel: [107061.604045] [ 655] 0 655 3809 626 73728 0 0 wpa_supplicant
May 19 08:46:24 ubuntu kernel: [107061.604073] [ 683] 0 683 1727 501 45056 0 0 cron
May 19 08:46:24 ubuntu kernel: [107061.604100] [ 703] 0 703 27482 2589 110592 0 0 unattended-upgr
May 19 08:46:24 ubuntu kernel: [107061.604128] [ 710] 0 710 1408 126 53248 0 0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604155] [ 712] 0 712 1397 139 49152 0 0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604183] [ 720] 0 720 3788 1039 69632 0 -1000 sshd
May 19 08:46:24 ubuntu kernel: [107061.604211] [ 844] 0 844 559 44 36864 0 0 hciattach
May 19 08:46:24 ubuntu kernel: [107061.604239] [ 856] 0 856 2384 602 61440 0 0 bluetoothd
May 19 08:46:24 ubuntu kernel: [107061.604266] [ 1172] 0 1172 74368 1369 167936 0 0 packagekitd
May 19 08:46:24 ubuntu kernel: [107061.604305] [ 1178] 0 1178 58582 814 94208 0 0 polkitd
May 19 08:46:24 ubuntu kernel: [107061.604336] [ 4481] 0 4481 4596 1078 81920 0 0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604364] [ 4484] 1000 4484 4559 1187 73728 0 0 systemd
May 19 08:46:24 ubuntu kernel: [107061.604391] [ 4485] 1000 4485 42829 1235 110592 0 0 (sd-pam)
May 19 08:46:24 ubuntu kernel: [107061.604421] [ 4571] 1000 4571 4631 881 81920 0 0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604448] [ 4572] 1000 4572 2147 846 53248 0 0 bash
May 19 08:46:24 ubuntu kernel: [107061.604481] [ 4674] 1000 4674 3345 616 61440 0 0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604509] [ 4675] 1000 4675 3345 172 61440 0 0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604536] [ 4676] 0 4676 1725546 180701 1495040 0 0 main
May 19 08:46:24 ubuntu kernel: [107061.604563] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-39.scope,task=main,pid=4676,uid=0
May 19 08:46:24 ubuntu kernel: [107061.604827] Out of memory: Killed process 4676 (main) total-vm:6902184kB, anon-rss:721280kB, file-rss:1524kB, shmem-rss:0kB, UID:0 pgtables:1460kB oom_score_adj:0
May 19 08:46:25 ubuntu systemd[1]: session-39.scope: A process of this unit has been killed by the OOM killer.