RWKV-Runner 设置RWKV_CUDA_ON=1后运行，出现警告且性能没有提升。

设置RWKV_CUDA_ON=1后运行，出现警告且性能没有提升。

Open gahoo opened this issue 1 year ago • 1 comments

export RWKV_CUDA_ON=1
python3 backend-python/main.py --host 0.0.0.0 --port 7070 --webui

curl http://127.0.0.1:7070/switch-model -X POST -H "Content-Type: application/json" -d '{"model":"/models/RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth","strategy":"cuda fp16","deploy":"false", "customCuda": "true"}'

部分运行信息如下：

Strategy Devices: {'cuda'}
state cache enabled                                                                                                                                                                  Using /RWKV-Runner-1.6.8/backend-python as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /RWKV-Runner-1.6.8/backend-python/wkv_cuda/build.ninja...
Building extension module wkv_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_cuda...
RWKV_JIT_ON 1 RWKV_CUDA_ON 1 RESCALE_LAYER 6

Loading /models/RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth ...
Model detected: v5.2
Strategy: (total 32+1=33 layers)
* cuda [float16, float16], store 33 layers
中间略过
Using /RWKV-Runner-1.6.8/backend-python as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /RWKV-Runner-1.6.8/backend-python/rwkv5/build.ninja...
Building extension module rwkv5...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF rwkv5_op.o.d -DTORCH_EXTENSION_NAME=rwkv5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=
\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/lo
cal/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.1
0 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /RWKV-Runner-1.6.8/backend-python/rwkv_pip/cuda/rwkv5_op.cpp -o rwkv5_op.o
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=rwkv5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=
\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/lo
cal/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.1
0 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -g
encode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -res-usage --use_fast_math -O3 -Xptxas -O3 --extra-device-vectorization -D_N_=6
4 -std=c++14 -c /RWKV-Runner-1.6.8/backend-python/rwkv_pip/cuda/rwkv5.cu -o rwkv5.cuda.o
ptxas info    : 3 bytes gmem
ptxas info    : Compiling entry function '_Z14kernel_forwardIfEviiiiPfPKT_S3_S3_PKfS3_PS1_' for 'sm_80'
ptxas info    : Function properties for _Z14kernel_forwardIfEviiiiPfPKT_S3_S3_PKfS3_PS1_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 102 registers, 1024 bytes smem, 424 bytes cmem[0]
ptxas info    : Compiling entry function '_Z14kernel_forwardIN3c104HalfEEviiiiPfPKT_S5_S5_PKfS5_PS3_' for 'sm_80'
ptxas info    : Function properties for _Z14kernel_forwardIN3c104HalfEEviiiiPfPKT_S5_S5_PKfS5_PS3_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 102 registers, 1024 bytes smem, 424 bytes cmem[0]
ptxas info    : Compiling entry function '_Z14kernel_forwardIN3c108BFloat16EEviiiiPfPKT_S5_S5_PKfS5_PS3_' for 'sm_80'
ptxas info    : Function properties for _Z14kernel_forwardIN3c108BFloat16EEviiiiPfPKT_S5_S5_PKfS5_PS3_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 102 registers, 1024 bytes smem, 424 bytes cmem[0]
[3/3] c++ rwkv5_op.o rwkv5.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L
/usr/local/cuda/lib64 -lcudart -o rwkv5.so
Loading extension module rwkv5...
/RWKV-Runner-1.6.8/backend-python/rwkv_pip/model.py:1824: UserWarning: operator() profile_node %318 : int = prim::profile_ivalue(%316)
 does not have profile information (Triggered internally at ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:105.) 
r, k, v, g, xxx, ss = self.v5_2_before(

请问这里的does not have profile information是缺少了什么东西导致的吗？虽然是个warning，但是我觉得这是性能没有改善的原因。

Jan 31 '24 08:01 gahoo

那个警告不影响, 编译算子后主要提升int8的速度, 以及prompt预处理速度, 如果你是全量fp16, 那么推理速度可能没有明显提升

Mar 01 '24 15:03 josStorer

RWKV-Runner RWKV-Runner copied to clipboard

设置RWKV_CUDA_ON=1后运行，出现警告且性能没有提升。

RWKV-Runner
RWKV-Runner copied to clipboard