xlstm Loading extension module slstm_HS128BS8NH4NS4DBfDRbDWbDGbDSbDAfNG4SA1GRCV0GRC0d0FCV0FC0d0...

{'verbose': True, 'with_cuda': True, 'extra_ldflags': ['-L/home/junlong/anaconda3/envs/xlstm/lib', '-lcublas'], 'extra_cflags': ['-DSLSTM_HIDDEN_SIZE=128', '-DSLSTM_BATCH_SIZE=8', '-DSLSTM_NUM_HEADS=4', '-DSLSTM_NUM_STATES=4', '-DSLSTM_DTYPE_B=float', '-DSLSTM_DTYPE_R=nv_bfloat16', '-DSLSTM_DTYPE_W=nv_bfloat16', '-DSLSTM_DTYPE_G=nv_bfloat16', '-DSLSTM_DTYPE_S=nv_bfloat16', '-DSLSTM_DTYPE_A=float', '-DSLSTM_NUM_GATES=4', '-DSLSTM_SIMPLE_AGG=true', '-DSLSTM_GRADIENT_RECURRENT_CLIPVAL_VALID=false', '-DSLSTM_GRADIENT_RECURRENT_CLIPVAL=0.0', '-DSLSTM_FORWARD_CLIPVAL_VALID=false', '-DSLSTM_FORWARD_CLIPVAL=0.0', '-U__CUDA_NO_HALF_OPERATORS', '-U__CUDA_NO_HALF_CONVERSIONS', '-U__CUDA_NO_BFLOAT16_OPERATORS', '-U__CUDA_NO_BFLOAT16_CONVERSIONS', '-U__CUDA_NO_BFLOAT162_OPERATORS__', '-U__CUDA_NO_BFLOAT162_CONVERSIONS__'], 'extra_cuda_cflags': ['-Xptxas="-v"', '-gencode', 'arch=compute_80,code=compute_80', '-res-usage', '--use_fast_math', '-O3', '-Xptxas -O3', '--extra-device-vectorization', '-DSLSTM_HIDDEN_SIZE=128', '-DSLSTM_BATCH_SIZE=8', '-DSLSTM_NUM_HEADS=4', '-DSLSTM_NUM_STATES=4', '-DSLSTM_DTYPE_B=float', '-DSLSTM_DTYPE_R=nv_bfloat16', '-DSLSTM_DTYPE_W=nv_bfloat16', '-DSLSTM_DTYPE_G=nv_bfloat16', '-DSLSTM_DTYPE_S=nv_bfloat16', '-DSLSTM_DTYPE_A=float', '-DSLSTM_NUM_GATES=4', '-DSLSTM_SIMPLE_AGG=true', '-DSLSTM_GRADIENT_RECURRENT_CLIPVAL_VALID=false', '-DSLSTM_GRADIENT_RECURRENT_CLIPVAL=0.0', '-DSLSTM_FORWARD_CLIPVAL_VALID=false', '-DSLSTM_FORWARD_CLIPVAL=0.0', '-U__CUDA_NO_HALF_OPERATORS', '-U__CUDA_NO_HALF_CONVERSIONS', '-U__CUDA_NO_BFLOAT16_OPERATORS', '-U__CUDA_NO_BFLOAT16_CONVERSIONS', '-U__CUDA_NO_BFLOAT162_OPERATORS__', '-U__CUDA_NO_BFLOAT162_CONVERSIONS__']} Using /home/junlong/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/junlong/.cache/torch_extensions/py311_cu121/slstm_HS128BS8NH4NS4DBfDRbDWbDGbDSbDAfNG4SA1GRCV0GRC0d0FCV0FC0d0/build.ninja... Building extension module slstm_HS128BS8NH4NS4DBfDRbDWbDGbDSbDAfNG4SA1GRCV0GRC0d0FCV0FC0d0... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module slstm_HS128BS8NH4NS4DBfDRbDWbDGbDSbDAfNG4SA1GRCV0GRC0d0FCV0FC0d0... How to solve this problem？

Aug 16 '24 02:08 weiqiang13

I encountered the same problem.

Aug 22 '24 07:08 Ninlawat-Puhu

@kpoeppel Hi, Could you advise how to solve ?

Aug 22 '24 07:08 Ninlawat-Puhu

me too,how to solve it~~

Sep 01 '24 12:09 vivianawoo

I have this problem aswell. This only occurs if you include the sLSTM module in the xLSTM stack, using only mLSTM works. I tested this on the lightning platform with NVIDIA L4 GPU.

Sep 03 '24 07:09 matiashaggman

Same here

Sep 11 '24 02:09 calliope-pro

Same problem here (linux, ubuntu)!

A "work around" is to set backend="vanilla" in sLSTMLayerConfig, however this will ofc result in very slow learning

Sep 29 '24 12:09 f-krause

Is it stuck in loading? Because I see no error in what you shared. If there is a loading error of the module you can clear your torch_extensions cache (typically $HOME/.cache/torch_extensions). In any case, make sure that your GPU has compute capability >= 8.0 (Ampere). This is needed for bfloat16.

Oct 10 '24 11:10 kpoeppel

Same problem here (linux, ubuntu)!

I made it work on my machine by changing the ninja version.

This config of versions currently works for me (though training is super slow on an A100 compared to transformers/mamba/torch lstm):

Ubuntu
Python 3.10.14
cuda 11.8
cudatoolkit=11.8.0
cudatoolkit-dev=11.7.0
pytorch 2.4.1
ninja 1.11.1.1
gcc 11.2.0
gxx_impl_linux-64 11.2.0
gxx_linux-64 11.2.0

Oct 10 '24 16:10 f-krause