运行chatglm3-6b官方 finetune 命令 报错 kernel 需要update
https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary chatglm3-6b官方模型和代码都拉了两遍,但是 运行 finetune 报错 kernel 需要update
!CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py data/AdvertiseGen_fix /mnt/workspace/chatglm3-6b configs/lora.yaml
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 日志如下:
2024-04-25 14:56:59.070864: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-04-25 14:56:59.073797: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.105362: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-25 14:56:59.105394: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-25 14:56:59.105413: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-25 14:56:59.111071: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.111270: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-25 14:57:00.457944: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:35<00:00, 5.10s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
Map (num_proc=16): 100%|██████| 114599/114599 [00:04<00:00, 24420.30 examples/s]
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1333.83 examples/s]
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1387.08 examples/s]
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
'': 30910 -> -100
'\n': 13 -> -100
'': 30910 -> -100
'类型': 33467 -> -100
'#': 31010 -> -100
'裤': 56532 -> -100
'': 30998 -> -100
'版': 55090 -> -100
'型': 54888 -> -100
'#': 31010 -> -100
'宽松': 40833 -> -100
'': 30998 -> -100
'风格': 32799 -> -100
'#': 31010 -> -100
'性感': 40589 -> -100
'': 30998 -> -100
'图案': 37505 -> -100
'#': 31010 -> -100
'线条': 37216 -> -100
'': 30998 -> -100
'裤': 56532 -> -100
'型': 54888 -> -100
'#': 31010 -> -100
'阔': 56529 -> -100
'腿': 56158 -> -100
'裤': 56532 -> -100
'<|assistant|>': 64796 -> -100
'': 30910 -> 30910
'\n': 13 -> 13
'': 30910 -> 30910
'宽松': 40833 -> 40833
'的': 54530 -> 54530
'阔': 56529 -> 56529
'腿': 56158 -> 56158
'裤': 56532 -> 56532
'这': 54551 -> 54551
'两年': 33808 -> 33808
'真的': 32041 -> 32041
'吸': 55360 -> 55360
'粉': 55486 -> 55486
'不少': 32138 -> 32138
',': 31123 -> 31123
'明星': 32943 -> 32943
'时尚': 33481 -> 33481
'达': 54880 -> 54880
'人的': 31664 -> 31664
'心头': 46565 -> 46565
'爱': 54799 -> 54799
'。': 31155 -> 31155
'毕竟': 33051 -> 33051
'好': 54591 -> 54591
'穿': 55432 -> 55432
'时尚': 33481 -> 33481
',': 31123 -> 31123
'谁': 55622 -> 55622
'都能': 32904 -> 32904
'穿': 55432 -> 55432
'出': 54557 -> 54557
'腿': 56158 -> 56158
'长': 54625 -> 54625
'2': 30943 -> 30943
'米': 55055 -> 55055
'的效果': 35590 -> 35590
'宽松': 40833 -> 40833
'的': 54530 -> 54530
'裤': 56532 -> 56532
'腿': 56158 -> 56158
',': 31123 -> 31123
'当然是': 48466 -> 48466
'遮': 57148 -> 57148
'肉': 55343 -> 55343
'小': 54603 -> 54603
'能手': 49355 -> 49355
'啊': 55674 -> 55674
'。': 31155 -> 31155
'上身': 51605 -> 51605
'随': 55119 -> 55119
'性': 54642 -> 54642
'自然': 31799 -> 31799
'不': 54535 -> 54535
'拘': 57036 -> 57036
'束': 55625 -> 55625
',': 31123 -> 31123
'面料': 46839 -> 46839
'亲': 55113 -> 55113
'肤': 56089 -> 56089
'舒适': 33894 -> 33894
'贴': 55778 -> 55778
'身体': 31902 -> 31902
'验': 55017 -> 55017
'感': 54706 -> 54706
'棒': 56382 -> 56382
'棒': 56382 -> 56382
'哒': 59230 -> 59230
'。': 31155 -> 31155
'系': 54712 -> 54712
'带': 54882 -> 54882
'部分': 31726 -> 31726
'增加': 31917 -> 31917
'设计': 31735 -> 31735
'看点': 45032 -> 45032
',': 31123 -> 31123
'还': 54656 -> 54656
'让': 54772 -> 54772
'单品': 46539 -> 46539
'的设计': 34481 -> 34481
'感': 54706 -> 54706
'更强': 43084 -> 43084
'。': 31155 -> 31155
'腿部': 46799 -> 46799
'线条': 37216 -> 37216
'若': 55351 -> 55351
'隐': 55733 -> 55733
'若': 55351 -> 55351
'现': 54600 -> 54600
'的': 54530 -> 54530
',': 31123 -> 31123
'性感': 40589 -> 40589
'撩': 58521 -> 58521
'人': 54533 -> 54533
'。': 31155 -> 31155
'颜色': 33692 -> 33692
'敲': 57004 -> 57004
'温柔': 34678 -> 34678
'的': 54530 -> 54530
',': 31123 -> 31123
'与': 54619 -> 54619
'裤子': 44722 -> 44722
'本身': 32754 -> 32754
'所': 54626 -> 54626
'呈现': 33169 -> 33169
'的风格': 48084 -> 48084
'有点': 33149 -> 33149
'反': 54955 -> 54955
'差': 55342 -> 55342
'萌': 56842 -> 56842
'。': 31155 -> 31155
'': 2 -> 2
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/workspace/finetune_hf.py:517 in main │
│ │
│ 514 │ model.gradient_checkpointing_enable() │
│ 515 │ model.enable_input_require_grads() │
│ 516 │ │
│ ❱ 517 │ trainer = Seq2SeqTrainer( │
│ 518 │ │ model=model, │
│ 519 │ │ args=ft_config.training_args, │
│ 520 │ │ data_collator=DataCollatorForSeq2Seq( │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:57 │
│ in init │
│ │
│ 54 │ │ optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_schedu │
│ 55 │ │ preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor │
│ 56 │ ): │
│ ❱ 57 │ │ super().init( │
│ 58 │ │ │ model=model, │
│ 59 │ │ │ args=args, │
│ 60 │ │ │ data_collator=data_collator, │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:514 in │
│ init │
│ │
│ 511 │ │ │ self.place_model_on_device │
│ 512 │ │ │ and not getattr(model, "quantization_method", None) == Qu │
│ 513 │ │ ): │
│ ❱ 514 │ │ │ self._move_model_to_device(model, args.device) │
│ 515 │ │ │
│ 516 │ │ # Force n_gpu to 1 to avoid DataParallel as MP will manage th │
│ 517 │ │ if self.is_model_parallel: │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:757 in │
│ _move_model_to_device │
│ │
│ 754 │ │ self.callback_handler.remove_callback(callback) │
│ 755 │ │
│ 756 │ def _move_model_to_device(self, model, device): │
│ ❱ 757 │ │ model = model.to(device) │
│ 758 │ │ # Moving a model to an XLA device disconnects the tied weight │
│ 759 │ │ if self.args.parallel_mode == ParallelMode.TPU and hasattr(mo │
│ 760 │ │ │ model.tie_weights() │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1160 in │
│ to │
│ │
│ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 1159 │ │ │
│ ❱ 1160 │ │ return self._apply(convert) │
│ 1161 │ │
│ 1162 │ def register_full_backward_pre_hook( │
│ 1163 │ │ self, │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:833 in │
│ _apply │
│ │
│ 830 │ │ │ # track autograd history of param_applied, so we have t │
│ 831 │ │ │ # with torch.no_grad(): │
│ 832 │ │ │ with torch.no_grad(): │
│ ❱ 833 │ │ │ │ param_applied = fn(param) │
│ 834 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 835 │ │ │ if should_use_set_data: │
│ 836 │ │ │ │ param.data = param_applied │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1158 in │
│ convert │
│ │
│ 1155 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │
│ 1156 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or │
│ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ ❱ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 1159 │ │ │
│ 1160 │ │ return self._apply(convert) │
│ 1161 │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so
the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
https://github.com/THUDM/ChatGLM3/blob/main/finetune_demo/lora_finetune.ipynb 教程参照这个
Could not find cuda drivers on your machine, GPU will not be used. 您的环境cuda没有找到
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
不是cuda环境的问题,我直接用的 你们 gpu的 cuda镜像怎么可能 没找到呢,要么就是你们3090经常挂
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.