2G内存,没有GPU,被killed了。请问怎么排查原因?
已经运行"export JT_SAVE_MEM=1"了还是不行。
终端输出如下:
(AI) root@:/home/code/JittorLLMs# python cli_demo.py chatglm [i 0526 22:51:31.788926 16 compiler.py:955] Jittor(1.3.7.16) src: /opt/conda/envs/AI/lib/python3.9/site-packages/jittor [i 0526 22:51:31.793616 16 compiler.py:956] g++ at /usr/bin/g++(8.4.0) [i 0526 22:51:31.794216 16 compiler.py:957] cache_path: /root/.cache/jittor/jt1.3.7/g++8.4.0/py3.9.1/Linux-5.4.0-88xa9/IntelRXeonRCPUx4f/default [i 0526 22:51:31.808813 16 init.py:411] Found addr2line(2.31.1) at /usr/bin/addr2line. [i 0526 22:51:32.260858 16 init.py:227] Total mem: 1.94GB, using 1 procs for compiling. Compiling jittor_core(28/150) used: 2.189s etaCompiling jittor_core(29/150) used: 2.191s etaCompiling jittor_core(30/150) used: 3.965s etaCompiling jittor_core(31/150) used: 3.968s etaCompiling jittor_core(32/150) used: 3.969s etaCompiling jittor_core(33/150) used: 3.971s etaCompiling jittor_core(34/150) used: 3.973s etaCompiling jittor_core(35/150) used: 3.976s etaCompiling jittor_core(36/150) used: 3.978s etaCompiling jittor_core(37/150) used: 3.981s etaCompiling jittor_core(38/150) used: 5.762s etaCompiling jittor_core(39/150) used: 5.769s etaCompiling jittor_core(40/150) used: 5.772s etaCompiling jittor_core(41/150) used: 5.773s etaCompiling jittor_core(42/150) used: 5.774s etaCompiling jittor_core(43/150) used: 5.775s etaCompiling jittor_core(44/150) used: 5.776s etaCompiling jittor_core(45/150) used: 5.778s etaCompiling jittor_core(46/150) used: 5.781s etaCompiling jittor_core(47/150) used: 5.782s etaCompiling jittor_core(48/150) used: 5.783s etaCompiling jittor_core(49/150) used: 5.788s etaCompiling jittor_core(50/150) used: 5.790s etaCompiling jittor_core(51/150) used: 5.791s etaCompiling jittor_core(52/150) used: 5.793s etaCompiling jittor_core(53/150) used: 5.794s etaCompiling jittor_core(54/150) used: 5.796s etaCompiling jittor_core(55/150) used: 5.798s etaCompiling jittor_core(56/150) used: 5.799s etaCompiling jittor_core(57/150) used: 5.800s etaCompiling jittor_core(58/150) used: 5.801s etaCompiling jittor_core(59/150) used: 5.802s etaCompiling jittor_core(60/150) used: 5.804s etaCompiling jittor_core(61/150) used: 5.805s etaCompiling jittor_core(62/150) used: 5.807s etaCompiling jittor_core(63/150) used: 5.808s etaCompiling jittor_core(64/150) used: 5.810s etaCompiling jittor_core(65/150) used: 5.813s etaCompiling jittor_core(66/150) used: 5.815s etaCompiling jittor_core(67/150) used: 5.817s etaCompiling jittor_core(68/150) used: 5.819s etaCompiling jittor_core(69/150) used: 5.820s etaCompiling jittor_core(70/150) used: 5.824s etaCompiling jittor_core(71/150) used: 5.827s etaCompiling jittor_core(72/150) used: 5.830s etaCompiling jittor_core(73/150) used: 5.834s etaCompiling jittor_core(74/150) used: 5.839s etaCompiling jittor_core(75/150) used: 5.843s etaCompiling jittor_core(76/150) used: 5.848s etaCompiling jittor_core(77/150) used: 5.854s etaCompiling jittor_core(78/150) used: 5.857s etaCompiling jittor_core(79/150) used: 5.860s etaCompiling jittor_core(80/150) used: 5.863s etaCompiling jittor_core(81/150) used: 5.866s etaCompiling jittor_core(82/150) used: 5.869s etaCompiling jittor_core(83/150) used: 5.873s etaCompiling jittor_core(84/150) used: 6.837s etaCompiling jittor_core(85/150) used: 6.837s etaCompiling jittor_core(86/150) used: 6.838s etaCompiling jittor_core(87/150) used: 6.840s etaCompiling jittor_core(88/150) used: 6.842s etaCompiling jittor_core(89/150) used: 6.843s etaCompiling jittor_core(90/150) used: 6.844s etaCompiling jittor_core(91/150) used: 6.845s etaCompiling jittor_core(92/150) used: 6.846s etaCompiling jittor_core(93/150) used: 6.848s etaCompiling jittor_core(94/150) used: 6.849s etaCompiling jittor_core(95/150) used: 8.811s etaCompiling jittor_core(96/150) used: 8.812s etaCompiling jittor_core(97/150) used: 8.813s etaCompiling jittor_core(98/150) used: 8.815s etaCompiling jittor_core(99/150) used: 8.818s etaCompiling jittor_core(100/150) used: 8.821s etCompiling jittor_core(101/150) used: 8.822s etCompiling jittor_core(102/150) used: 8.824s etCompiling jittor_core(103/150) used: 8.827s etCompiling jittor_core(104/150) used: 8.828s etCompiling jittor_core(105/150) used: 8.830s etCompiling jittor_core(106/150) used: 8.832s etCompiling jittor_core(107/150) used: 8.834s etCompiling jittor_core(108/150) used: 8.836s etCompiling jittor_core(109/150) used: 8.838s etCompiling jittor_core(110/150) used: 8.839s etCompiling jittor_core(111/150) used: 8.841s etCompiling jittor_core(112/150) used: 8.843s etCompiling jittor_core(113/150) used: 8.845s etCompiling jittor_core(114/150) used: 8.847s etCompiling jittor_core(115/150) used: 8.849s etCompiling jittor_core(116/150) used: 8.851s etCompiling jittor_core(117/150) used: 8.853s etCompiling jittor_core(118/150) used: 8.855s etCompiling jittor_core(119/150) used: 8.856s etCompiling jittor_core(120/150) used: 8.858s etCompiling jittor_core(121/150) used: 8.861s etCompiling jittor_core(122/150) used: 8.863s etCompiling jittor_core(123/150) used: 8.865s etCompiling jittor_core(124/150) used: 8.867s etCompiling jittor_core(125/150) used: 8.869s etCompiling jittor_core(126/150) used: 8.871s etCompiling jittor_core(127/150) used: 8.874s etCompiling jittor_core(128/150) used: 8.878s etCompiling jittor_core(129/150) used: 8.881s etCompiling jittor_core(130/150) used: 8.885s etCompiling jittor_core(131/150) used: 8.887s etCompiling jittor_core(132/150) used: 8.891s etCompiling jittor_core(133/150) used: 8.895s etCompiling jittor_core(134/150) used: 8.897s etCompiling jittor_core(135/150) used: 8.899s etCompiling jittor_core(136/150) used: 8.902s etCompiling jittor_core(137/150) used: 8.903s etCompiling jittor_core(138/150) used: 11.110s eCompiling jittor_core(139/150) used: 12.884s eCompiling jittor_core(140/150) used: 12.885s eCompiling jittor_core(141/150) used: 12.886s eCompiling jittor_core(142/150) used: 12.888s eCompiling jittor_core(143/150) used: 12.889s eCompiling jittor_core(144/150) used: 12.890s eCompiling jittor_core(145/150) used: 12.891s eCompiling jittor_core(146/150) used: 12.892s eCompiling jittor_core(147/150) used: 12.894s eCompiling jittor_core(148/150) used: 12.895s eCompiling jittor_core(149/150) used: 12.897s eCompiling jittor_core(150/150) used: 16.218s eta: 0.000s [i 0526 22:51:49.020603 16 jit_compiler.cc:28] Load cc_path: /usr/bin/g++ 2023-05-26 22:51:50.796404: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-05-26 22:51:51.334188: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-05-26 22:51:51.336342: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-26 22:51:53.191343: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Explicitly passing a
revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Loading checkpoint shards: 0%| | 0/8 [00:00<Killed (AI) root@:/home/code/JittorLLMs#
请问如何找原因?谢谢!
检查下Jittor(1.3.7.16)版本,我的没加限制前问一个问题,就自动Killed,加上后可以了,但就是慢,一个问题20分钟左右吧,还是简单的(8C16G,1T的SSD):
同问一下怎么解决的,我现在用的虚拟机12G内存,没有GPU也是一直Killed。