distributed-llama icon indicating copy to clipboard operation
distributed-llama copied to clipboard

(Crashing on Low Memory SBC) main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0

Open unclemusclez opened this issue 1 year ago • 0 comments

Is there anyway that main and worker could be separated so I can use a cluster of 8 RPi 3b+ for the compute but the scheduling is offset to another device with more memory? I understand this is most likely not a priority. Perhaps a smaller model? https://github.com/jzhang38/TinyLlama ?

main:

ubuntu@ubuntu:~/distributed-llama$ sudo main chat --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --model ~/dllama_meta-lla
ma-3-8b_q40.bin --tokenizer ~/dllama-llama3-tokenizer.t --workers 192.168.2.212:9998 192.168.2.213:9998 192.168.2.214:9998 192.168.2.215:
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 8
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
Killed

Worker

ubuntu@ubuntu:~$ sudo nice -n -20 main worker --port 9998 --nthreads 4]
Listening on 0.0.0.0:9998...
Client connected
terminate called after throwing an instance of 'ReadSocketException'
  what():  std::exception
Aborted
May 19 08:46:24 ubuntu kernel: [107061.602328] main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
May 19 08:46:24 ubuntu kernel: [107061.602392] CPU: 0 PID: 4676 Comm: main Tainted: G         C  E     5.15.0-1055-raspi #58-Ubuntu
May 19 08:46:24 ubuntu kernel: [107061.602412] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
May 19 08:46:24 ubuntu kernel: [107061.602423] Call trace:
May 19 08:46:24 ubuntu kernel: [107061.602430]  dump_backtrace+0x0/0x200
May 19 08:46:24 ubuntu kernel: [107061.602455]  show_stack+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602470]  dump_stack_lvl+0x8c/0xb8
May 19 08:46:24 ubuntu kernel: [107061.602490]  dump_stack+0x18/0x34
May 19 08:46:24 ubuntu kernel: [107061.602506]  dump_header+0x54/0x21c
May 19 08:46:24 ubuntu kernel: [107061.602520]  oom_kill_process+0x22c/0x230
May 19 08:46:24 ubuntu kernel: [107061.602539]  out_of_memory+0xf4/0x370
May 19 08:46:24 ubuntu kernel: [107061.602554]  __alloc_pages_slowpath.constprop.0+0x604/0x8e0
May 19 08:46:24 ubuntu kernel: [107061.602574]  __alloc_pages+0x29c/0x320
May 19 08:46:24 ubuntu kernel: [107061.602590]  alloc_zeroed_user_highpage_movable+0x40/0x50
May 19 08:46:24 ubuntu kernel: [107061.602607]  do_anonymous_page+0x88/0x4ec
May 19 08:46:24 ubuntu kernel: [107061.602628]  handle_pte_fault+0x170/0x1c0
May 19 08:46:24 ubuntu kernel: [107061.602642]  __handle_mm_fault+0x1d0/0x350
May 19 08:46:24 ubuntu kernel: [107061.602655]  handle_mm_fault+0x108/0x294
May 19 08:46:24 ubuntu kernel: [107061.602669]  faultin_page+0x84/0x150
May 19 08:46:24 ubuntu kernel: [107061.602685]  __get_user_pages+0x194/0x2c0
May 19 08:46:24 ubuntu kernel: [107061.602701]  populate_vma_page_range+0x64/0x70
May 19 08:46:24 ubuntu kernel: [107061.602719]  __mm_populate+0xc4/0x1d0
May 19 08:46:24 ubuntu kernel: [107061.602735]  do_mlock+0xdc/0x26c
May 19 08:46:24 ubuntu kernel: [107061.602750]  __arm64_sys_mlock+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602765]  invoke_syscall+0x50/0x120
May 19 08:46:24 ubuntu kernel: [107061.602784]  el0_svc_common.constprop.0+0x6c/0x1a0
May 19 08:46:24 ubuntu kernel: [107061.602803]  do_el0_svc+0x30/0xb0
May 19 08:46:24 ubuntu kernel: [107061.602820]  el0_svc+0x4c/0x170
May 19 08:46:24 ubuntu kernel: [107061.602837]  el0t_64_sync_handler+0xa4/0x130
May 19 08:46:24 ubuntu kernel: [107061.602854]  el0t_64_sync+0x1a4/0x1a8
May 19 08:46:24 ubuntu kernel: [107061.602888] Mem-Info:
May 19 08:46:24 ubuntu kernel: [107061.602905] active_anon:735 inactive_anon:16569 isolated_anon:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  active_file:36 inactive_file:28 isolated_file:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  unevictable:185356 dirty:0 writeback:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  slab_reclaimable:6070 slab_unreclaimable:10550
May 19 08:46:24 ubuntu kernel: [107061.602905]  mapped:1869 shmem:749 pagetables:923 bounce:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  kernel_misc_reclaimable:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  free:5609 free_pcp:0 free_cma:0
May 19 08:46:24 ubuntu kernel: [107061.602949] Node 0 active_anon:2940kB inactive_anon:66276kB active_file:144kB inactive_file:112kB unevictable:741424kB isolated(anon):0kB isolated(file):0kB mapped:7476kB dirty:0kB writeback:0kB shmem:2996kB >May 19 08:46:24 ubuntu kernel: [107061.602992] DMA free:22436kB min:24576kB low:30208kB high:35840kB reserved_highatomic:0KB active_anon:2940kB inactive_anon:66276kB active_file:196kB inactive_file:292kB unevictable:741332kB writepending:0kB p>May 19 08:46:24 ubuntu kernel: [107061.603035] lowmem_reserve[]: 0 0 0 0
May 19 08:46:24 ubuntu kernel: [107061.603114] DMA: 1113*4kB (UME) 633*8kB (UME) 296*16kB (UME) 129*32kB (UME) 48*64kB (UME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22860kB
May 19 08:46:24 ubuntu kernel: [107061.603406] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 19 08:46:24 ubuntu kernel: [107061.603428] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
May 19 08:46:24 ubuntu kernel: [107061.603449] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 19 08:46:24 ubuntu kernel: [107061.603469] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
May 19 08:46:24 ubuntu kernel: [107061.603489] 2704 total pagecache pages
May 19 08:46:24 ubuntu kernel: [107061.603504] 0 pages in swap cache
May 19 08:46:24 ubuntu kernel: [107061.603518] Swap cache stats: add 0, delete 0, find 0/0
May 19 08:46:24 ubuntu kernel: [107061.603536] Free swap  = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603550] Total swap = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603565] 242688 pages RAM
May 19 08:46:24 ubuntu kernel: [107061.603580] 0 pages HighMem/MovableOnly
May 19 08:46:24 ubuntu kernel: [107061.603594] 10931 pages reserved
May 19 08:46:24 ubuntu kernel: [107061.603609] 16384 pages cma reserved
May 19 08:46:24 ubuntu kernel: [107061.603624] Tasks state (memory values in pages):
May 19 08:46:24 ubuntu kernel: [107061.603638] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 19 08:46:24 ubuntu kernel: [107061.603685] [    379]     0   379    12038      852    94208        0          -250 systemd-journal
May 19 08:46:24 ubuntu kernel: [107061.603716] [    406]     0   406    72414     6415   118784        0         -1000 multipathd
May 19 08:46:24 ubuntu kernel: [107061.603745] [    420]     0   420     5982      942    69632        0         -1000 systemd-udevd
May 19 08:46:24 ubuntu kernel: [107061.603789] [    553]   103   553    22163      732    77824        0             0 systemd-timesyn
May 19 08:46:24 ubuntu kernel: [107061.603819] [    612]   100   612     4068      777    73728        0             0 systemd-network
May 19 08:46:24 ubuntu kernel: [107061.603847] [    614]   101   614     6339     1633    90112        0             0 systemd-resolve
May 19 08:46:24 ubuntu kernel: [107061.603875] [    625]   102   625     2267      838    57344        0          -900 dbus-daemon
May 19 08:46:24 ubuntu kernel: [107061.603904] [    629]     0   629    20487      611    65536        0             0 irqbalance
May 19 08:46:24 ubuntu kernel: [107061.603933] [    634]     0   634     8236     2733   114688        0             0 networkd-dispat
May 19 08:46:24 ubuntu kernel: [107061.603961] [    640]   104   640    55504      826    81920        0             0 rsyslogd
May 19 08:46:24 ubuntu kernel: [107061.603989] [    644]     0   644   366640     2855   249856        0          -900 snapd
May 19 08:46:24 ubuntu kernel: [107061.604017] [    653]     0   653     3887      791    69632        0             0 systemd-logind
May 19 08:46:24 ubuntu kernel: [107061.604045] [    655]     0   655     3809      626    73728        0             0 wpa_supplicant
May 19 08:46:24 ubuntu kernel: [107061.604073] [    683]     0   683     1727      501    45056        0             0 cron
May 19 08:46:24 ubuntu kernel: [107061.604100] [    703]     0   703    27482     2589   110592        0             0 unattended-upgr
May 19 08:46:24 ubuntu kernel: [107061.604128] [    710]     0   710     1408      126    53248        0             0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604155] [    712]     0   712     1397      139    49152        0             0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604183] [    720]     0   720     3788     1039    69632        0         -1000 sshd
May 19 08:46:24 ubuntu kernel: [107061.604211] [    844]     0   844      559       44    36864        0             0 hciattach
May 19 08:46:24 ubuntu kernel: [107061.604239] [    856]     0   856     2384      602    61440        0             0 bluetoothd
May 19 08:46:24 ubuntu kernel: [107061.604266] [   1172]     0  1172    74368     1369   167936        0             0 packagekitd
May 19 08:46:24 ubuntu kernel: [107061.604305] [   1178]     0  1178    58582      814    94208        0             0 polkitd
May 19 08:46:24 ubuntu kernel: [107061.604336] [   4481]     0  4481     4596     1078    81920        0             0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604364] [   4484]  1000  4484     4559     1187    73728        0             0 systemd
May 19 08:46:24 ubuntu kernel: [107061.604391] [   4485]  1000  4485    42829     1235   110592        0             0 (sd-pam)
May 19 08:46:24 ubuntu kernel: [107061.604421] [   4571]  1000  4571     4631      881    81920        0             0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604448] [   4572]  1000  4572     2147      846    53248        0             0 bash
May 19 08:46:24 ubuntu kernel: [107061.604481] [   4674]  1000  4674     3345      616    61440        0             0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604509] [   4675]  1000  4675     3345      172    61440        0             0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604536] [   4676]     0  4676  1725546   180701  1495040        0             0 main
May 19 08:46:24 ubuntu kernel: [107061.604563] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-39.scope,task=main,pid=4676,uid=0
May 19 08:46:24 ubuntu kernel: [107061.604827] Out of memory: Killed process 4676 (main) total-vm:6902184kB, anon-rss:721280kB, file-rss:1524kB, shmem-rss:0kB, UID:0 pgtables:1460kB oom_score_adj:0
May 19 08:46:25 ubuntu systemd[1]: session-39.scope: A process of this unit has been killed by the OOM killer.

unclemusclez avatar May 19 '24 09:05 unclemusclez