[BUG]: PaddingMode.NEG_INF not working under float8_e4m3

Open iori2333 opened this issue 1 week ago • 1 comments

Version

1.0.0

Version

13.1 sm_120

Which installation method(s) does this occur on?

Pip

Describe the bug.

I'm writing a softmax kernel using cutile:

@ct.kernel
def softmax(
    input: ct.Array,  # [b, s]
    output: ct.Array,  # [b, s]
    b: ct.Constant[int],
    ts: ct.Constant[int],
):
    bid = ct.bid(0)
    blocks = ct.num_blocks(0)

    for idx in range(bid, b, blocks):
        line = ct.load(
            input,
            index=(idx, 0),
            shape=(1, ts),
            padding_mode=ct.PaddingMode.NEG_INF,
            allow_tma=True,
        ).astype(ct.float32)

        line = line - ct.max(line, axis=-1, keepdims=True)
        e_line = ct.exp(line)
        o_line = e_line / ct.sum(e_line, axis=-1, keepdims=True)
        o_line = o_line.astype(input.dtype)  # type: ignore
        ct.store(output, index=(idx, 0), tile=o_line)

This works fine when input dtype is bfloat16. However, when input dtype is float8e4m3, cutile throws error:

Traceback (most recent call last):
  File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 362, in compile_cubin
    subprocess.run(command + flags, env=env, check=True, capture_output=True,
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/cuda/bin/tileiras', '/tmp/tmpxqnzdmg7/softmaxtcbkty38.bytecode', '-o', '/tmp/tmpxqnzdmg7/softmaxtcbkty38.cubin', '--gpu-name', 'sm_120', '-O3', '--lineinfo']' died with <Signals.SIGILL: 4>.

During handling of the above exception, another exception occurred:
   ...
  File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 365, in compile_cubin
    raise TileCompilerExecutionError(e.returncode, e.stderr.decode(), ' '.join(flags),
cuda.tile._exception.TileCompilerExecutionError: Return code -4

Unknown location

After some debugging, I found that the issue is related to padding_mode when calling ct.load. Cutile works only if padding_mode is set to ZERO, NEG_ZERO, or UNDETERMINED.

Minimum reproducible example

def launch_softmax(
    input: torch.Tensor,
    output: torch.Tensor | None = None,
) -> torch.Tensor:
    if output is None:
        output = torch.empty_like(input)

    b, s = input.shape
    grid = (min(128, b),)
    tile_size = next_power_of_2(s)

    ct.launch(
        torch.cuda.current_stream(),
        grid,
        softmax,
        (input, output, b, tile_size),
    )

    return output

Relevant log output

Traceback (most recent call last):
  File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 362, in compile_cubin
    subprocess.run(command + flags, env=env, check=True, capture_output=True,
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/cuda/bin/tileiras', '/tmp/tmpxqnzdmg7/softmaxtcbkty38.bytecode', '-o', '/tmp/tmpxqnzdmg7/softmaxtcbkty38.cubin', '--gpu-name', 'sm_120', '-O3', '--lineinfo']' died with <Signals.SIGILL: 4>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/scripts/test_softmax.py", line 78, in <module>
    bench()
  File "/workspace/scripts/test_softmax.py", line 34, in bench
    bench_fn(
  File "/workspace/kernels/utils/__init__.py", line 15, in bench_fn
    fn(*args, **kwargs)
  File "/workspace/scripts/test_softmax.py", line 22, in do_test
    launcher(input, output)
  File "/workspace/kernels/cutile/softmax.py", line 48, in launch_softmax
    ct.launch(
  File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 223, in __call__
    lib = compile_tile(self.pyfunc, pyfunc_args, self.compiler_options, tile_context)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 70, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 211, in compile_tile
    raise e
  File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 204, in compile_tile
    cubin_file = compile_cubin(f.name, compiler_options, sm_arch,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 365, in compile_cubin
    raise TileCompilerExecutionError(e.returncode, e.stderr.decode(), ' '.join(flags),
cuda.tile._exception.TileCompilerExecutionError: Return code -4

Unknown location

Full env printout

<details><summary>Click here to see environment details</summary><pre>
     
     **git***
     Not inside a git repository
     
     ***OS Information***
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=24.04
     DISTRIB_CODENAME=noble
     DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
     PRETTY_NAME="Ubuntu 24.04.3 LTS"
     NAME="Ubuntu"
     VERSION_ID="24.04"
     VERSION="24.04.3 LTS (Noble Numbat)"
     VERSION_CODENAME=noble
     ID=ubuntu
     ID_LIKE=debian
     HOME_URL="https://www.ubuntu.com/"
     SUPPORT_URL="https://help.ubuntu.com/"
     BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
     PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
     UBUNTU_CODENAME=noble
     LOGO=ubuntu-logo
     Linux mayu 6.17.11+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.17.11-1 (2025-12-07) x86_64 x86_64 x86_64 GNU/Linux
     
     ***GPU Information***
     Fri Dec 12 05:09:48 2025
     +-----------------------------------------------------------------------------------------+
     | NVIDIA-SMI 590.44.01              Driver Version: 590.44.01      CUDA Version: 13.1     |
     +-----------------------------------------+------------------------+----------------------+
     | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
     |                                         |                        |               MIG M. |
     |=========================================+========================+======================|
     |   0  NVIDIA GeForce RTX 5060 Ti     On  |   00000000:01:00.0 Off |                  N/A |
     |  0%   44C    P8              8W /  180W |       2MiB /  16311MiB |      0%      Default |
     |                                         |                        |                  N/A |
     +-----------------------------------------+------------------------+----------------------+
     
     +-----------------------------------------------------------------------------------------+
     | Processes:                                                                              |
     |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
     |        ID   ID                                                               Usage      |
     |=========================================================================================|
     |  No running processes found                                                             |
     +-----------------------------------------------------------------------------------------+
     
     ***CPU***
     Architecture:                            x86_64
     CPU op-mode(s):                          32-bit, 64-bit
     Address sizes:                           48 bits physical, 48 bits virtual
     Byte Order:                              Little Endian
     CPU(s):                                  32
     On-line CPU(s) list:                     0-31
     Vendor ID:                               AuthenticAMD
     Model name:                              AMD Ryzen 9 7940HX with Radeon Graphics
     CPU family:                              25
     Model:                                   97
     Thread(s) per core:                      2
     Core(s) per socket:                      16
     Socket(s):                               1
     Stepping:                                2
     Frequency boost:                         enabled
     CPU(s) scaling MHz:                      43%
     CPU max MHz:                             5314.2139
     CPU min MHz:                             416.1740
     BogoMIPS:                                4790.82
     Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
     Virtualization:                          AMD-V
     L1d cache:                               512 KiB (16 instances)
     L1i cache:                               512 KiB (16 instances)
     L2 cache:                                16 MiB (16 instances)
     L3 cache:                                64 MiB (2 instances)
     NUMA node(s):                            1
     NUMA node0 CPU(s):                       0-31
     Vulnerability Gather data sampling:      Not affected
     Vulnerability Ghostwrite:                Not affected
     Vulnerability Indirect target selection: Not affected
     Vulnerability Itlb multihit:             Not affected
     Vulnerability L1tf:                      Not affected
     Vulnerability Mds:                       Not affected
     Vulnerability Meltdown:                  Not affected
     Vulnerability Mmio stale data:           Not affected
     Vulnerability Old microcode:             Not affected
     Vulnerability Reg file data sampling:    Not affected
     Vulnerability Retbleed:                  Not affected
     Vulnerability Spec rstack overflow:      Mitigation; Safe RET
     Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
     Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
     Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
     Vulnerability Srbds:                     Not affected
     Vulnerability Tsa:                       Mitigation; Clear CPU buffers
     Vulnerability Tsx async abort:           Not affected
     Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace
     
     ***CMake***
     /usr/bin/cmake
     cmake version 3.28.3
     
     CMake suite maintained and supported by Kitware (kitware.com/cmake).
     
     ***g++***
     /usr/bin/g++
     g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
     Copyright (C) 2023 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
     
     
     ***nvcc***
     /usr/local/cuda/bin/nvcc
     nvcc: NVIDIA (R) Cuda compiler driver
     Copyright (c) 2005-2025 NVIDIA Corporation
     Built on Fri_Nov__7_07:23:37_PM_PST_2025
     Cuda compilation tools, release 13.1, V13.1.80
     Build cuda_13.1.r13.1/compiler.36836380_0
     
     ***Python***
     /workspace/.venv/bin/python
     Python 3.12.3
     
     ***Environment Variables***
     PATH                            : /workspace/.venv/bin:/vscode/vscode-server/bin/linux-x64/618725e67565b290ba4da6fe2d29f8fa1d4e3622/bin/remote-cli:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ubuntu/.vscode-server/extensions/ms-python.debugpy-2025.16.0/bundled/scripts/noConfigScripts
     LD_LIBRARY_PATH                 : /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    :
     PYTHON_PATH                     :
     
     conda not found
     ***pip packages***
     /usr/bin/pip
     Package     Version
     ----------- -------
     dbus-python 1.3.2
     pip         24.0
     Pygments    2.17.2
     PyGObject   3.48.2
     PyYAML      6.0.1
     setuptools  68.1.2
     uv          0.9.16
     wheel       0.42.0
     
</pre></details>

Other/Misc.

No response

Contributing Guidelines

[x] I agree to follow cuTile Python's contributing guidelines
[x] I have searched the open bugs and have found no duplicates for this bug report

Dec 12 '25 04:12 iori2333

@iori2333 thanks for reporting. I have reproduced this bug and it may need fix from the cuda tileir compiler. We will work on a fix.

Dec 12 '25 19:12 haijieg