modelscope icon indicating copy to clipboard operation
modelscope copied to clipboard

运行chatglm3-6b官方 finetune 命令 报错 kernel 需要update

Open alexhmyang opened this issue 1 year ago • 3 comments

https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary chatglm3-6b官方模型和代码都拉了两遍,但是 运行 finetune 报错 kernel 需要update

!CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py data/AdvertiseGen_fix /mnt/workspace/chatglm3-6b configs/lora.yaml

Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 日志如下:

2024-04-25 14:56:59.070864: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-04-25 14:56:59.073797: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2024-04-25 14:56:59.105362: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-04-25 14:56:59.105394: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-04-25 14:56:59.105413: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-04-25 14:56:59.111071: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2024-04-25 14:56:59.111270: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-04-25 14:57:00.457944: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Setting eos_token is not supported, use the default one. Setting pad_token is not supported, use the default one. Setting unk_token is not supported, use the default one. Loading checkpoint shards: 100%|██████████████████| 7/7 [00:35<00:00, 5.10s/it] trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614 --> Model

--> model has 1.949696M params

Map (num_proc=16): 100%|██████| 114599/114599 [00:04<00:00, 24420.30 examples/s] train_dataset: Dataset({ features: ['input_ids', 'labels'], num_rows: 114599 }) Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1333.83 examples/s] val_dataset: Dataset({ features: ['input_ids', 'output_ids'], num_rows: 1070 }) Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1387.08 examples/s] test_dataset: Dataset({ features: ['input_ids', 'output_ids'], num_rows: 1070 }) --> Sanity check '[gMASK]': 64790 -> -100 'sop': 64792 -> -100 '<|user|>': 64795 -> -100 '': 30910 -> -100 '\n': 13 -> -100 '': 30910 -> -100 '类型': 33467 -> -100 '#': 31010 -> -100 '裤': 56532 -> -100 '': 30998 -> -100 '版': 55090 -> -100 '型': 54888 -> -100 '#': 31010 -> -100 '宽松': 40833 -> -100 '': 30998 -> -100 '风格': 32799 -> -100 '#': 31010 -> -100 '性感': 40589 -> -100 '': 30998 -> -100 '图案': 37505 -> -100 '#': 31010 -> -100 '线条': 37216 -> -100 '': 30998 -> -100 '裤': 56532 -> -100 '型': 54888 -> -100 '#': 31010 -> -100 '阔': 56529 -> -100 '腿': 56158 -> -100 '裤': 56532 -> -100 '<|assistant|>': 64796 -> -100 '': 30910 -> 30910 '\n': 13 -> 13 '': 30910 -> 30910 '宽松': 40833 -> 40833 '的': 54530 -> 54530 '阔': 56529 -> 56529 '腿': 56158 -> 56158 '裤': 56532 -> 56532 '这': 54551 -> 54551 '两年': 33808 -> 33808 '真的': 32041 -> 32041 '吸': 55360 -> 55360 '粉': 55486 -> 55486 '不少': 32138 -> 32138 ',': 31123 -> 31123 '明星': 32943 -> 32943 '时尚': 33481 -> 33481 '达': 54880 -> 54880 '人的': 31664 -> 31664 '心头': 46565 -> 46565 '爱': 54799 -> 54799 '。': 31155 -> 31155 '毕竟': 33051 -> 33051 '好': 54591 -> 54591 '穿': 55432 -> 55432 '时尚': 33481 -> 33481 ',': 31123 -> 31123 '谁': 55622 -> 55622 '都能': 32904 -> 32904 '穿': 55432 -> 55432 '出': 54557 -> 54557 '腿': 56158 -> 56158 '长': 54625 -> 54625 '2': 30943 -> 30943 '米': 55055 -> 55055 '的效果': 35590 -> 35590 '宽松': 40833 -> 40833 '的': 54530 -> 54530 '裤': 56532 -> 56532 '腿': 56158 -> 56158 ',': 31123 -> 31123 '当然是': 48466 -> 48466 '遮': 57148 -> 57148 '肉': 55343 -> 55343 '小': 54603 -> 54603 '能手': 49355 -> 49355 '啊': 55674 -> 55674 '。': 31155 -> 31155 '上身': 51605 -> 51605 '随': 55119 -> 55119 '性': 54642 -> 54642 '自然': 31799 -> 31799 '不': 54535 -> 54535 '拘': 57036 -> 57036 '束': 55625 -> 55625 ',': 31123 -> 31123 '面料': 46839 -> 46839 '亲': 55113 -> 55113 '肤': 56089 -> 56089 '舒适': 33894 -> 33894 '贴': 55778 -> 55778 '身体': 31902 -> 31902 '验': 55017 -> 55017 '感': 54706 -> 54706 '棒': 56382 -> 56382 '棒': 56382 -> 56382 '哒': 59230 -> 59230 '。': 31155 -> 31155 '系': 54712 -> 54712 '带': 54882 -> 54882 '部分': 31726 -> 31726 '增加': 31917 -> 31917 '设计': 31735 -> 31735 '看点': 45032 -> 45032 ',': 31123 -> 31123 '还': 54656 -> 54656 '让': 54772 -> 54772 '单品': 46539 -> 46539 '的设计': 34481 -> 34481 '感': 54706 -> 54706 '更强': 43084 -> 43084 '。': 31155 -> 31155 '腿部': 46799 -> 46799 '线条': 37216 -> 37216 '若': 55351 -> 55351 '隐': 55733 -> 55733 '若': 55351 -> 55351 '现': 54600 -> 54600 '的': 54530 -> 54530 ',': 31123 -> 31123 '性感': 40589 -> 40589 '撩': 58521 -> 58521 '人': 54533 -> 54533 '。': 31155 -> 31155 '颜色': 33692 -> 33692 '敲': 57004 -> 57004 '温柔': 34678 -> 34678 '的': 54530 -> 54530 ',': 31123 -> 31123 '与': 54619 -> 54619 '裤子': 44722 -> 44722 '本身': 32754 -> 32754 '所': 54626 -> 54626 '呈现': 33169 -> 33169 '的风格': 48084 -> 48084 '有点': 33149 -> 33149 '反': 54955 -> 54955 '差': 55342 -> 55342 '萌': 56842 -> 56842 '。': 31155 -> 31155 '': 2 -> 2 Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /mnt/workspace/finetune_hf.py:517 in main │ │ │ │ 514 │ model.gradient_checkpointing_enable() │ │ 515 │ model.enable_input_require_grads() │ │ 516 │ │ │ ❱ 517 │ trainer = Seq2SeqTrainer( │ │ 518 │ │ model=model, │ │ 519 │ │ args=ft_config.training_args, │ │ 520 │ │ data_collator=DataCollatorForSeq2Seq( │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:57 │ │ in init │ │ │ │ 54 │ │ optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_schedu │ │ 55 │ │ preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor │ │ 56 │ ): │ │ ❱ 57 │ │ super().init( │ │ 58 │ │ │ model=model, │ │ 59 │ │ │ args=args, │ │ 60 │ │ │ data_collator=data_collator, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:514 in │ │ init │ │ │ │ 511 │ │ │ self.place_model_on_device │ │ 512 │ │ │ and not getattr(model, "quantization_method", None) == Qu │ │ 513 │ │ ): │ │ ❱ 514 │ │ │ self._move_model_to_device(model, args.device) │ │ 515 │ │ │ │ 516 │ │ # Force n_gpu to 1 to avoid DataParallel as MP will manage th │ │ 517 │ │ if self.is_model_parallel: │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:757 in │ │ _move_model_to_device │ │ │ │ 754 │ │ self.callback_handler.remove_callback(callback) │ │ 755 │ │ │ 756 │ def _move_model_to_device(self, model, device): │ │ ❱ 757 │ │ model = model.to(device) │ │ 758 │ │ # Moving a model to an XLA device disconnects the tied weight │ │ 759 │ │ if self.args.parallel_mode == ParallelMode.TPU and hasattr(mo │ │ 760 │ │ │ model.tie_weights() │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1160 in │ │ to │ │ │ │ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │ │ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │ │ 1159 │ │ │ │ ❱ 1160 │ │ return self._apply(convert) │ │ 1161 │ │ │ 1162 │ def register_full_backward_pre_hook( │ │ 1163 │ │ self, │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │ │ _apply │ │ │ │ 807 │ def _apply(self, fn, recurse=True): │ │ 808 │ │ if recurse: │ │ 809 │ │ │ for module in self.children(): │ │ ❱ 810 │ │ │ │ module._apply(fn) │ │ 811 │ │ │ │ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │ │ _apply │ │ │ │ 807 │ def _apply(self, fn, recurse=True): │ │ 808 │ │ if recurse: │ │ 809 │ │ │ for module in self.children(): │ │ ❱ 810 │ │ │ │ module._apply(fn) │ │ 811 │ │ │ │ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │ │ _apply │ │ │ │ 807 │ def _apply(self, fn, recurse=True): │ │ 808 │ │ if recurse: │ │ 809 │ │ │ for module in self.children(): │ │ ❱ 810 │ │ │ │ module._apply(fn) │ │ 811 │ │ │ │ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │ │ _apply │ │ │ │ 807 │ def _apply(self, fn, recurse=True): │ │ 808 │ │ if recurse: │ │ 809 │ │ │ for module in self.children(): │ │ ❱ 810 │ │ │ │ module._apply(fn) │ │ 811 │ │ │ │ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │ │ _apply │ │ │ │ 807 │ def _apply(self, fn, recurse=True): │ │ 808 │ │ if recurse: │ │ 809 │ │ │ for module in self.children(): │ │ ❱ 810 │ │ │ │ module._apply(fn) │ │ 811 │ │ │ │ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:833 in │ │ _apply │ │ │ │ 830 │ │ │ # track autograd history of param_applied, so we have t │ │ 831 │ │ │ # with torch.no_grad(): │ │ 832 │ │ │ with torch.no_grad(): │ │ ❱ 833 │ │ │ │ param_applied = fn(param) │ │ 834 │ │ │ should_use_set_data = compute_should_use_set_data(param, │ │ 835 │ │ │ if should_use_set_data: │ │ 836 │ │ │ │ param.data = param_applied │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1158 in │ │ convert │ │ │ │ 1155 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │ │ 1156 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or │ │ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │ │ ❱ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │ │ 1159 │ │ │ │ 1160 │ │ return self._apply(convert) │ │ 1161 │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.


https://github.com/THUDM/ChatGLM3/blob/main/finetune_demo/lora_finetune.ipynb 教程参照这个

image

alexhmyang avatar Apr 25 '24 07:04 alexhmyang

Could not find cuda drivers on your machine, GPU will not be used. 您的环境cuda没有找到

wenmengzhou avatar Apr 25 '24 10:04 wenmengzhou

Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

alexhmyang avatar Apr 26 '24 06:04 alexhmyang

不是cuda环境的问题,我直接用的 你们 gpu的 cuda镜像怎么可能 没找到呢,要么就是你们3090经常挂

alexhmyang avatar Apr 26 '24 06:04 alexhmyang

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 27 '24 01:05 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Jun 01 '24 01:06 github-actions[bot]