cutile-python
cutile-python copied to clipboard
[BUG]: PaddingMode.NEG_INF not working under float8_e4m3
Version
1.0.0
Version
13.1 sm_120
Which installation method(s) does this occur on?
Pip
Describe the bug.
I'm writing a softmax kernel using cutile:
@ct.kernel
def softmax(
input: ct.Array, # [b, s]
output: ct.Array, # [b, s]
b: ct.Constant[int],
ts: ct.Constant[int],
):
bid = ct.bid(0)
blocks = ct.num_blocks(0)
for idx in range(bid, b, blocks):
line = ct.load(
input,
index=(idx, 0),
shape=(1, ts),
padding_mode=ct.PaddingMode.NEG_INF,
allow_tma=True,
).astype(ct.float32)
line = line - ct.max(line, axis=-1, keepdims=True)
e_line = ct.exp(line)
o_line = e_line / ct.sum(e_line, axis=-1, keepdims=True)
o_line = o_line.astype(input.dtype) # type: ignore
ct.store(output, index=(idx, 0), tile=o_line)
This works fine when input dtype is bfloat16. However, when input dtype is float8e4m3, cutile throws error:
Traceback (most recent call last):
File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 362, in compile_cubin
subprocess.run(command + flags, env=env, check=True, capture_output=True,
File "/usr/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/cuda/bin/tileiras', '/tmp/tmpxqnzdmg7/softmaxtcbkty38.bytecode', '-o', '/tmp/tmpxqnzdmg7/softmaxtcbkty38.cubin', '--gpu-name', 'sm_120', '-O3', '--lineinfo']' died with <Signals.SIGILL: 4>.
During handling of the above exception, another exception occurred:
...
File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 365, in compile_cubin
raise TileCompilerExecutionError(e.returncode, e.stderr.decode(), ' '.join(flags),
cuda.tile._exception.TileCompilerExecutionError: Return code -4
Unknown location
After some debugging, I found that the issue is related to padding_mode when calling ct.load. Cutile works only if padding_mode is set to ZERO, NEG_ZERO, or UNDETERMINED.
Minimum reproducible example
def launch_softmax(
input: torch.Tensor,
output: torch.Tensor | None = None,
) -> torch.Tensor:
if output is None:
output = torch.empty_like(input)
b, s = input.shape
grid = (min(128, b),)
tile_size = next_power_of_2(s)
ct.launch(
torch.cuda.current_stream(),
grid,
softmax,
(input, output, b, tile_size),
)
return output
Relevant log output
Traceback (most recent call last):
File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 362, in compile_cubin
subprocess.run(command + flags, env=env, check=True, capture_output=True,
File "/usr/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/cuda/bin/tileiras', '/tmp/tmpxqnzdmg7/softmaxtcbkty38.bytecode', '-o', '/tmp/tmpxqnzdmg7/softmaxtcbkty38.cubin', '--gpu-name', 'sm_120', '-O3', '--lineinfo']' died with <Signals.SIGILL: 4>.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/scripts/test_softmax.py", line 78, in <module>
bench()
File "/workspace/scripts/test_softmax.py", line 34, in bench
bench_fn(
File "/workspace/kernels/utils/__init__.py", line 15, in bench_fn
fn(*args, **kwargs)
File "/workspace/scripts/test_softmax.py", line 22, in do_test
launcher(input, output)
File "/workspace/kernels/cutile/softmax.py", line 48, in launch_softmax
ct.launch(
File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 223, in __call__
lib = compile_tile(self.pyfunc, pyfunc_args, self.compiler_options, tile_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 70, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 211, in compile_tile
raise e
File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 204, in compile_tile
cubin_file = compile_cubin(f.name, compiler_options, sm_arch,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/.venv/lib/python3.12/site-packages/cuda/tile/_compile.py", line 365, in compile_cubin
raise TileCompilerExecutionError(e.returncode, e.stderr.decode(), ' '.join(flags),
cuda.tile._exception.TileCompilerExecutionError: Return code -4
Unknown location
Full env printout
<details><summary>Click here to see environment details</summary><pre>
**git***
Not inside a git repository
***OS Information***
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
Linux mayu 6.17.11+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.17.11-1 (2025-12-07) x86_64 x86_64 x86_64 GNU/Linux
***GPU Information***
Fri Dec 12 05:09:48 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01 Driver Version: 590.44.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5060 Ti On | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 8W / 180W | 2MiB / 16311MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
***CPU***
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 7940HX with Radeon Graphics
CPU family: 25
Model: 97
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU(s) scaling MHz: 43%
CPU max MHz: 5314.2139
CPU min MHz: 416.1740
BogoMIPS: 4790.82
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
Virtualization: AMD-V
L1d cache: 512 KiB (16 instances)
L1i cache: 512 KiB (16 instances)
L2 cache: 16 MiB (16 instances)
L3 cache: 64 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Ghostwrite: Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Old microcode: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsa: Mitigation; Clear CPU buffers
Vulnerability Tsx async abort: Not affected
Vulnerability Vmscape: Mitigation; IBPB before exit to userspace
***CMake***
/usr/bin/cmake
cmake version 3.28.3
CMake suite maintained and supported by Kitware (kitware.com/cmake).
***g++***
/usr/bin/g++
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
***nvcc***
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Nov__7_07:23:37_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.80
Build cuda_13.1.r13.1/compiler.36836380_0
***Python***
/workspace/.venv/bin/python
Python 3.12.3
***Environment Variables***
PATH : /workspace/.venv/bin:/vscode/vscode-server/bin/linux-x64/618725e67565b290ba4da6fe2d29f8fa1d4e3622/bin/remote-cli:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ubuntu/.vscode-server/extensions/ms-python.debugpy-2025.16.0/bundled/scripts/noConfigScripts
LD_LIBRARY_PATH : /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
NUMBAPRO_NVVM :
NUMBAPRO_LIBDEVICE :
CONDA_PREFIX :
PYTHON_PATH :
conda not found
***pip packages***
/usr/bin/pip
Package Version
----------- -------
dbus-python 1.3.2
pip 24.0
Pygments 2.17.2
PyGObject 3.48.2
PyYAML 6.0.1
setuptools 68.1.2
uv 0.9.16
wheel 0.42.0
</pre></details>
Other/Misc.
No response
Contributing Guidelines
- [x] I agree to follow cuTile Python's contributing guidelines
- [x] I have searched the open bugs and have found no duplicates for this bug report
@iori2333 thanks for reporting. I have reproduced this bug and it may need fix from the cuda tileir compiler. We will work on a fix.