Loading extension module slstm_HS128BS8NH4NS4DBfDRbDWbDGbDSbDAfNG4SA1GRCV0GRC0d0FCV0FC0d0...
{'verbose': True, 'with_cuda': True, 'extra_ldflags': ['-L/home/junlong/anaconda3/envs/xlstm/lib', '-lcublas'], 'extra_cflags': ['-DSLSTM_HIDDEN_SIZE=128', '-DSLSTM_BATCH_SIZE=8', '-DSLSTM_NUM_HEADS=4', '-DSLSTM_NUM_STATES=4', '-DSLSTM_DTYPE_B=float', '-DSLSTM_DTYPE_R=nv_bfloat16', '-DSLSTM_DTYPE_W=nv_bfloat16', '-DSLSTM_DTYPE_G=nv_bfloat16', '-DSLSTM_DTYPE_S=nv_bfloat16', '-DSLSTM_DTYPE_A=float', '-DSLSTM_NUM_GATES=4', '-DSLSTM_SIMPLE_AGG=true', '-DSLSTM_GRADIENT_RECURRENT_CLIPVAL_VALID=false', '-DSLSTM_GRADIENT_RECURRENT_CLIPVAL=0.0', '-DSLSTM_FORWARD_CLIPVAL_VALID=false', '-DSLSTM_FORWARD_CLIPVAL=0.0', '-U__CUDA_NO_HALF_OPERATORS', '-U__CUDA_NO_HALF_CONVERSIONS', '-U__CUDA_NO_BFLOAT16_OPERATORS', '-U__CUDA_NO_BFLOAT16_CONVERSIONS', '-U__CUDA_NO_BFLOAT162_OPERATORS__', '-U__CUDA_NO_BFLOAT162_CONVERSIONS__'], 'extra_cuda_cflags': ['-Xptxas="-v"', '-gencode', 'arch=compute_80,code=compute_80', '-res-usage', '--use_fast_math', '-O3', '-Xptxas -O3', '--extra-device-vectorization', '-DSLSTM_HIDDEN_SIZE=128', '-DSLSTM_BATCH_SIZE=8', '-DSLSTM_NUM_HEADS=4', '-DSLSTM_NUM_STATES=4', '-DSLSTM_DTYPE_B=float', '-DSLSTM_DTYPE_R=nv_bfloat16', '-DSLSTM_DTYPE_W=nv_bfloat16', '-DSLSTM_DTYPE_G=nv_bfloat16', '-DSLSTM_DTYPE_S=nv_bfloat16', '-DSLSTM_DTYPE_A=float', '-DSLSTM_NUM_GATES=4', '-DSLSTM_SIMPLE_AGG=true', '-DSLSTM_GRADIENT_RECURRENT_CLIPVAL_VALID=false', '-DSLSTM_GRADIENT_RECURRENT_CLIPVAL=0.0', '-DSLSTM_FORWARD_CLIPVAL_VALID=false', '-DSLSTM_FORWARD_CLIPVAL=0.0', '-U__CUDA_NO_HALF_OPERATORS', '-U__CUDA_NO_HALF_CONVERSIONS', '-U__CUDA_NO_BFLOAT16_OPERATORS', '-U__CUDA_NO_BFLOAT16_CONVERSIONS', '-U__CUDA_NO_BFLOAT162_OPERATORS__', '-U__CUDA_NO_BFLOAT162_CONVERSIONS__']} Using /home/junlong/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/junlong/.cache/torch_extensions/py311_cu121/slstm_HS128BS8NH4NS4DBfDRbDWbDGbDSbDAfNG4SA1GRCV0GRC0d0FCV0FC0d0/build.ninja... Building extension module slstm_HS128BS8NH4NS4DBfDRbDWbDGbDSbDAfNG4SA1GRCV0GRC0d0FCV0FC0d0... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module slstm_HS128BS8NH4NS4DBfDRbDWbDGbDSbDAfNG4SA1GRCV0GRC0d0FCV0FC0d0... How to solve this problem?
I encountered the same problem.
@kpoeppel Hi, Could you advise how to solve ?
me too,how to solve it~~
I have this problem aswell. This only occurs if you include the sLSTM module in the xLSTM stack, using only mLSTM works. I tested this on the lightning platform with NVIDIA L4 GPU.
Same here
Same problem here (linux, ubuntu)!
A "work around" is to set backend="vanilla" in sLSTMLayerConfig, however this will ofc result in very slow learning
Is it stuck in loading? Because I see no error in what you shared. If there is a loading error of the module you can clear your torch_extensions cache (typically $HOME/.cache/torch_extensions). In any case, make sure that your GPU has compute capability >= 8.0 (Ampere). This is needed for bfloat16.
Same problem here (linux, ubuntu)!
I made it work on my machine by changing the ninja version.
This config of versions currently works for me (though training is super slow on an A100 compared to transformers/mamba/torch lstm):
- Ubuntu
- Python 3.10.14
- cuda 11.8
- cudatoolkit=11.8.0
- cudatoolkit-dev=11.7.0
- pytorch 2.4.1
- ninja 1.11.1.1
- gcc 11.2.0
- gxx_impl_linux-64 11.2.0
- gxx_linux-64 11.2.0