taichi
taichi copied to clipboard
np.float32 smallest subnormal set to zero after `ti.init`
Describe the bug
The smallest subnormal of the numpy.float32 dtype is set to zero after ti.init, causing computations to fail completely. Oddly, the warning disappears and the correct subnormal of 1.4012985e-45 gets printed if you print(np.finfo(np.float32)) before initializing taichi, but the computation still fails regardless.
EDIT I see that one issue was how I was initializing the response container - so at least I get results now! I'm just wondering now if the results are incongruent due to my logic or this subnormal issue...
To Reproduce
import taichi as ti
import numpy as np
from numpy.typing import NDArray
def ti_volume_correlation(kernel_slab: NDArray, stimulus: NDArray):
"""Use Taichi to do a volume convolution.
This should wrap the process of creating ti.field objects and whatnot.
"""
# Response container.
# response_container = ti.field(dtype=ti.f32, shape=stimulus.shape[2] - kernel_size)
kernel_size = kernel_slab.shape[2]
batch_count = stimulus.shape[2] - kernel_size
# Critically, batch_count is the length of the response container.
response_container = np.ascontiguousarray(np.zeros(batch_count), dtype=np.float32)
# Should modify in place.
_gpu_convolve(
response_container,
np.ascontiguousarray(kernel_slab, dtype=np.float32),
np.ascontiguousarray(stimulus, dtype=np.float32),
batch_count,
)
return response_container
@ti.kernel
def _gpu_convolve(
response_container: ti.types.ndarray(dtype=ti.f32, ndim=1),
kernel_slab: ti.types.ndarray(dtype=ti.f32, ndim=3),
stimulus: ti.types.ndarray(dtype=ti.f32, ndim=3),
batch_count: int,
):
for x_idx, y_idx, t_idx in kernel_slab:
for t_offset in range(batch_count):
product = (
kernel_slab[x_idx, y_idx, t_idx]
* stimulus[x_idx, y_idx, t_idx + t_offset]
)
response_container[t_offset] += product
def slow_python_volume_correlation(kernel_slab: NDArray, stimulus: NDArray):
kernel_size = kernel_slab.shape[2]
response = np.zeros(stimulus.shape[2] - kernel_size)
# Manual convolution because I'm in a hurry and don't have time to figure out numpy's bullshit
for idx, start in enumerate(range(stimulus.shape[2] - kernel_size)):
if idx % 100 == 0:
print(f"Slab #{idx + 1} started.")
response[idx] = np.sum(
kernel_slab * stimulus[:, :, start : start + kernel_size]
)
return response
if __name__ == "__main__":
# print("before TI init")
# print(np.finfo(np.float32))
ti.init(arch=ti.metal)
print("after TI init")
print(np.finfo(np.float32))
rng = np.random.default_rng(100)
kernel_slab = rng.normal(size=(250, 250, 400))
signal = rng.normal(size=(250, 250, 500))
response = ti_volume_correlation(kernel_slab, signal)
print(response.shape)
print(response.max())
print(response.min())
response_2 = slow_python_volume_correlation(kernel_slab, signal)
print(response_2.shape)
print(response_2.max())
print(response_2.min())
Log/Screenshots
❯ python prove_taichi_broken.py
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] Starting on arch=metal
after TI init
Machine parameters for float32
---------------------------------------------------------------
precision = 6 resolution = 1.0000000e-06
machep = -23 eps = 1.1920929e-07
negep = -24 epsneg = 5.9604645e-08
minexp = -126 tiny = 1.1754944e-38
maxexp = 128 max = 3.4028235e+38
nexp = 8 min = -max
smallest_normal = 1.1754944e-38 smallest_subnormal = 0.0000000e+00
---------------------------------------------------------------
/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
return self._float_to_str(self.smallest_subnormal)
(200,)
435.3192
-568.0371
Slab #1 started.
Slab #101 started.
(200,)
12459.768944424823
-15875.401244324563
Additional comments
Output from ti diagnose:
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
*******************************************
** Taichi Programming Language **
*******************************************
Docs: https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum: https://forum.taichi.graphics/
Taichi system diagnose:
python: 3.11.5 (main, Aug 31 2023, 23:43:57) [Clang 14.0.3 (clang-1403.0.22.14.1)]
system: darwin
executable: /Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/bin/python3.11
platform: macOS-14.0-arm64-arm-64bit
architecture: 64bit
uname: uname_result(system='Darwin', node='datamancer.lan', release='23.0.0', version='Darwin Kernel Version 23.0.0: Fri Sep 15 14:41:43 PDT 2023; root:xnu-10002.1.13~1/RELEASE_ARM64_T6000', machine='arm64')
/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages/taichi/tools/diagnose.py:20: DeprecationWarning: Use setlocale(), getencoding() and getlocale() instead
print(f'locale: {".".join(locale.getdefaultlocale())}')
locale: en_US.UTF-8
PATH: /Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/bin:/Users/ricorodriguez/.asdf/shims:/opt/homebrew/opt/asdf/libexec/bin:/usr/local/share/google-cloud-sdk/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/Users/ricorodriguez/bin:/Users/ricorodriguez/.local/bin:/usr/local/sbin:/usr/local/opt/[email protected]/bin:/opt/homebrew/opt/sqlite/bin:/Users/ricorodriguez/.poetry/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/opt/X11/bin:/Library/Apple/usr/bin:/usr/local/MacGPG2/bin:/Library/TeX/texbin:/Applications/Little Snitch.app/Contents/Components:/usr/local/go/bin:/Users/ricorodriguez/.fig/bin:/Users/ricorodriguez/.local/bin:/opt/homebrew/opt/fzf/bin
PYTHONPATH: ['/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/bin', '/Users/ricorodriguez/.asdf/installs/python/3.11.5/Library/Frameworks/Python.framework/Versions/3.11/lib/python311.zip', '/Users/ricorodriguez/.asdf/installs/python/3.11.5/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11', '/Users/ricorodriguez/.asdf/installs/python/3.11.5/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/lib-dynload', '/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages']
`lsb_release` not available: [Errno 2] No such file or directory: 'lsb_release'
import: <module 'taichi' from '/Users/ricorodriguez/GradSchool/VHLab/CatDiffusion/env/lib/python3.11/site-packages/taichi/__init__.py'>
cc: False
cpu: True
metal: True
opengl: False
cuda: False
vulkan: True
`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo'
`nvidia-smi` not available: [Errno 2] No such file or directory: 'nvidia-smi'
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] Starting on arch=arm64
[W 10/04/23 14:14:37.549 541207] [misc.py:adaptive_arch_select@753] Arch=[<Arch.opengl: 6>] is not supported, falling back to CPU
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] Starting on arch=arm64
[W 10/04/23 14:14:37.807 541268] [misc.py:adaptive_arch_select@753] Arch=[<Arch.cuda: 4>] is not supported, falling back to CPU
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
[Taichi] Starting on arch=arm64
[Taichi] version 1.6.0, llvm 15.0.7, commit f1c6fbbd, osx, python 3.11.5
*******************************************
** Taichi Programming Language **
*******************************************
Docs: https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum: https://forum.taichi.graphics/
TAICHI EXAMPLES
────────────────────────────────────────────────────────────────────────────────────
0: ad_gravity 25: keyboard 50: pbf2d
1: circle_packing_image 26: laplace 51: physarum
2: comet 27: laplace_equation 52: poisson_disk_sampling
3: cornell_box 28: mandelbrot_zoom 53: print_offset
4: diff_sph 29: marching_squares 54: rasterizer
5: euler 30: mass_spring_3d_ggui 55: regression
6: eulerfluid2d 31: mass_spring_game 56: sdf_renderer
7: explicit_activation 32: mass_spring_game_ggui 57: simple_derivative
8: export_mesh 33: mciso_advanced 58: simple_texture
9: export_ply 34: mgpcg 59: simple_uv
10: export_videos 35: mgpcg_advanced 60: snow_phaseField
11: fem128 36: minimal 61: stable_fluid
12: fem128_ggui 37: minimization 62: stable_fluid_ggui
13: fem99 38: mpm128 63: stable_fluid_graph
14: fractal 39: mpm128_ggui 64: taichi_bitmasked
15: fractal3d_ggui 40: mpm3d 65: taichi_dynamic
16: fullscreen 41: mpm3d_ggui 66: taichi_logo
17: game_of_life 42: mpm88 67: taichi_ngp
18: gui_image_io 43: mpm88_graph 68: taichi_sparse
19: gui_widgets 44: mpm99 69: texture_graph
20: implicit_fem 45: mpm_lagrangian_forces 70: tutorial
21: implicit_mass_spring 46: nbody 71: two_stream_instability
22: initial_value_problem 47: odop_solar 72: vortex_rings
23: jacobian 48: oit_renderer 73: waterwave
24: karman_vortex_street 49: patterns
────────────────────────────────────────────────────────────────────────────────────
42
Running example minimal ...
[Taichi] Starting on arch=arm64
42.0
>>> Running time: 0.11s
Consider attaching this log when maintainers ask about system information.
>>> Running time: 2.91s
Ah, nevermind - it appears that I have a race condition going on, as indicated by running with the CPU backend.
I'm completely new to GPU programming - what is the recommendation here? Should I use intermediate containers to store the results of a given block? Or is there some other pattern that I should be following?
Also, I'm still curious about the original warning - where could that be coming from?
Here's what I'm doing now - is this a good approach? It seems to work, although obviously there's some loss of precision with float32.
def ti_volume_convolution(kernel_slab: NDArray, stimulus: NDArray):
"""Use Taichi to do a volume convolution.
This should wrap the process of creating ti.field objects and whatnot.
"""
# Response container.
# response_container = ti.field(dtype=ti.f32, shape=stimulus.shape[2] - kernel_size)
width, height, kernel_size = kernel_slab.shape
stimulus_length = stimulus.shape[2]
batch_count = stimulus_length - kernel_size
# Critically, batch_count is the length of the response container.
compute_cells = np.ascontiguousarray(
np.zeros(shape=(kernel_size, batch_count), dtype=np.float32)
)
response_container = np.ascontiguousarray(np.zeros(batch_count), dtype=np.float32)
# Should modify in place.
_gpu_convolve(
response_container,
compute_cells,
np.ascontiguousarray(kernel_slab, dtype=np.float32),
np.ascontiguousarray(stimulus, dtype=np.float32),
width,
height,
kernel_size,
)
return response_container
@ti.kernel
def _gpu_convolve(
response_container: ti.types.ndarray(dtype=ti.f32, ndim=1),
compute_cells: ti.types.ndarray(dtype=ti.f32, ndim=2),
kernel_slab: ti.types.ndarray(dtype=ti.f32, ndim=3),
stimulus: ti.types.ndarray(dtype=ti.f32, ndim=3),
width: ti.int32,
height: ti.int32,
kernel_size: ti.int32,
):
"""Brute force GPU convolution.
Note that in taichi kernels, only the outermost loops are parallelized.
Want to parallelize:
- computation of piecewise products
- Summation of convolved slabs
"""
for t_idx, batch_idx in compute_cells:
for x_idx in range(width):
for y_idx in range(height):
compute_cells[t_idx, batch_idx] += (
kernel_slab[x_idx, y_idx, t_idx]
* stimulus[x_idx, y_idx, t_idx + batch_idx]
)
for batch_idx in response_container:
batch_total = 0
for t_idx in range(kernel_size):
batch_total += compute_cells[t_idx, batch_idx]
response_container[batch_idx] = batch_total
Hi Datamance, Awesome work! Can you be more specific about the fp32 precision issue? Is it somewhat related to overflow or underflow?
Should I use intermediate containers to store the results of a given block? Or is there some other pattern that I should be following?
I think you're doing a good job so far - the goal of Taichi is to make sure the parallel codes work efficiently without having to worry about these staffs. There are some more advanced patterns like ti.simt.SharedArray, but it's very trivial and I would only recommend if you're seeking for extreme performance
@jim19930609 So the exact warning was
UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
It sounds like it might be the same issue that this person was facing in another context, but I can't be sure. Here's the key part of what they wrote:
Some Googling led me to this issue, which pointed toward some shared library compiled with the gcc/clang option -ffast-math that was being loaded as the culprit. It turns out (somewhat insanely) that when -ffast-math is enabled, the compiler will link in a constructor that sets the FTZ/DAZ flags whenever the library is loaded — even on shared libraries, which means that any application that loads that library will have its floating point behavior changed for the whole process. And -Ofast, which sounds appealingly like a "make my program go fast" flag, automatically enables -ffast-math, so some projects may unwittingly turn it on without realizing the implications.
Does taichi depend on any shared libraries that might do this, or something like it?
Interesting...That was correct, taichi does use fast math by default and this is likely the reason.
But honestly that sounds more like an issue with fast math itself