RWKV-LM-LoRA Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [92,0,0] Assertion `srcIndex

Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Open cdg1921 opened this issue 1 year ago • 0 comments

在进行3B模型调优时，报了以下错误： ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f56e3a67457 in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f56e3a313ec in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f570eadbc64 in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x1e0dc (0x7f570eab30dc in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f570eab6054 in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #5: + 0x4d6e23 (0x7f57399a5e23 in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: c10::TensorImpl::copy_tensor_metadata_except_version_counter(c10::TensorImpl const*, c10::TensorImpl*, bool) + 0x41 (0x7f56e3a491a1 in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libc10.so) frame #7: c10::TensorImpl::copy_tensor_metadata(c10::TensorImpl const*, c10::TensorImpl*, c10::VariableVersion const&, bool) + 0x14 (0x7f56e3a49214 in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libc10.so) frame #8: + 0x4404d (0x7f56e3a5304d in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libc10.so) frame #9: + 0x489ab33 (0x7f571342db33 in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #10: THPVariable_set_data(THPVariable*, _object*, void*) + 0x6f (0x7f5739bfa77f in /opt/conda/envs/rwkv38/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #53: __libc_start_main + 0xf3 (0x7f575b7b3083 in /lib/x86_64-linux-gnu/libc.so.6)

我的配置是： CUDA_VISIBLE_DEVICES=0 python train.py
--load_model "/usr/local/RWKV-LM-LoRA/RWKV-4-Raven-3B-v12-Eng49%-Chn49%-Jpn1%-Other1%-20230527-ctx4096.pth"
--proj_dir "/usr/local/RWKV-LM-LoRA/modelcheckpoint"
--data_file "/usr/local/RWKV-LM-LoRA/bininx_data/dev_1_text_document"
--data_type binidx
--vocab_size 50277
--ctx_len 4096
--accumulate_grad_batches 4
--epoch_steps 32
--epoch_count 2
--epoch_begin 0
--epoch_save 2
--micro_bsz 2
--n_layer 32
--n_embd 2560
--pre_ffn 0
--head_qk 0
--lr_init 1e-5
--lr_final 1e-5
--warmup_steps 0
--beta1 0.9
--beta2 0.999
--adam_eps 1e-8
--accelerator gpu
--devices 1
--precision bf16
--strategy deepspeed_stage_2
--grad_cp 1
--lora
--lora_r 8
--lora_alpha 16
--lora_dropout 0.01
--lora_parts=att,ffn,time,ln

查看了很多资料，都没有解决，大家遇到过这个问题吗？有什么办法能解决呢？

Aug 18 '23 03:08 cdg1921

RWKV-LM-LoRA RWKV-LM-LoRA copied to clipboard

Indexing.cu:1141: indexSelectLargeIndex: block: [202,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

RWKV-LM-LoRA
RWKV-LM-LoRA copied to clipboard