代码中缺少了对transpose_matmul_248_kernel的定义。
代码中使用的库存在问题,这个错误出现在quantization.py文件中,这个文件似乎是Hugging Face模型缓存中的一部分。
quantization.py", line 265, 应该是忘记导入 transpose_matmul_248_kernel 了,我找了一个,文件头部没有导入。
这个代码是deepspeed 微调时触发的, 走到model_engine.backward(loss) 时出的问题
"""
File "fine_tune.py", line 117, in
model_engine.backward(loss)
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1796, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1923, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 62, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/home/zeal/pytorch-venv/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 123, in decorate_bwd
return bwd(*args, **kwargs)
File "/home/zeal/.cache/huggingface/modules/transformers_modules/fnlp/moss-moon-003-sft-plugin-int4/353c499f7415575ba217704f3f28a1e817eb7487/quantization.py", line 292, in backward
grad_input = transpose_matmul248(grad_output, qweight, scales, qzeros, g_idx, bits, maxq)
File "/home/zeal/.cache/huggingface/modules/transformers_modules/fnlp/moss-moon-003-sft-plugin-int4/353c499f7415575ba217704f3f28a1e817eb7487/quantization.py", line 265, in transpose_matmul248
transpose_matmul_248_kernel[grid](input, qweight, output,
NameError: name 'transpose_matmul_248_kernel' is not defined
同志们,我估计是这里写错了,transpose_matmul_248_kernel指的应该是168行定义的trans_matmul_248_kernel,修改一下函数名就可以
继续处理下一个bug😊
@lipengyuer 不行啊我改了 .cache里面的文件 和 models里面 对应的quantization.py文件都改了,但是重新跑 脚本.cache的那个文件又变回去了