taichi
taichi copied to clipboard
[perf] Enable load store forwarding across loop hierarchy for vulkan performance.
The store_to_load_forwarding
transformation in the cfg_optimization
pass ignores ExternalPtrStmt. It causes performance issue in the SPH case. The CUDA backend is insensitive to this optimization, but Vulkan backend cannot handles this well even on the same hardware.
I think we can get a ~2.2x speed up for the SPH case when this optimization is enabled.
The related source code section is here.
After some investigation, this seems not related to Ndarray in the store_to_load_forwarding
pass. The code block (as mentioned) does consider Ndarray, which is ExternalPtrStmt
:
https://github.com/taichi-dev/taichi/blob/233f017f00753986b8aea194b2e30ca7a727159e/taichi/ir/control_flow_graph.cpp#L264-L294
Let be Ndarray or Field, the pattern will start with a GlobalLoadStmt
, hence the else-if clause will be executed, and store_forwarding
remains true.
The problem here is that the get_store_forwarding_data
function cannot find the forwarded data for certain scenarios:
https://github.com/taichi-dev/taichi/blob/233f017f00753986b8aea194b2e30ca7a727159e/taichi/ir/control_flow_graph.cpp#L291
For example, a dummy kernel like this can be optimized by store_to_load_forwarding
:
@ti.kernel
def foo_ndarray(acc: ti.any_arr(field_dim=1)):
for i in range(16):
acc[i] = 1.0
acc[i] -= i
acc[i] %= i
Nevertheless, if the access happens e.g., in a loop, the optimization no longer works:
@ti.kernel
def foo_field(acc: ti.template()):
for i in range(16):
acc[i] = 1.0
for j in range(8):
acc[i] -= j
for k in range(4):
acc[i] %= k
This is regardless of Field/Ndarray. The question here becomes how to adapt the get_store_forwarding_data
function to consider the second case forwardable.
OK, so this is not a cause of ndarray vs field performance gap.
@jim19930609 @strongoier Any suggestions on this? Should we update the get_store_forwarding_data
function to be able to optimize this kind of pattern as well?
@ti.kernel
def foo_field(acc: ti.template()):
for i in range(16):
acc[i] = 1.0
for j in range(8):
acc[i] -= j
for k in range(4):
acc[i] %= k
Similar to
@ti.kernel
def foo_field(acc: ti.template()):
for i in range(16):
var = acc[i]
var = 1.0
for j in range(8):
var -= j
for k in range(4):
var %= k
acc[i] = var
I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing?
I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing?
Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful).
Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful).
Then feel free to hack it!
Update:
A simplified version that reproduce the problem:
The original baseline version: natural to write, but slow.
@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
for i in range(N):
acc = 0.0
for j in range(N):
acc += v[i] * d[j]
a[i] = acc
The fast version that manually make LSF.
@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
for i in range(N):
acc = 0.0
v_val = v[i]
for j in range(N):
acc += v_val * d[j]
a[i] = acc
We save the v[i]
outside, as it is actually constant in this loop.
The optimization pass needs 2-steps:
- Analyze that the loads that has no relation with the loop variants.
- Move the load outside the loop
Another possible solution:
This case SHOULD be handled by L1 cache. Is that because some annotations in our SPIR-V codegen is somewhat improper? For example should the v[i]
be annotated as in Private
scope?
Desired performance of this code snippet on RTX3080 (The vulkan shared memory version is not yet merged though)
Full runnable code:
import time
import taichi as ti
import numpy as np
ti.init(arch=ti.vulkan)
block_dim = 512
nBlocks = 1024
N = nBlocks * block_dim
np.random.seed(42)
v_arr = np.random.randn(N).astype(np.float32)
d_arr = np.random.randn(N).astype(np.float32)
a_arr = np.zeros(N).astype(np.float32)
b_arr = np.zeros(N).astype(np.float32)
@ti.kernel
def init(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
for i in range(N):
v[i] = ti.random(ti.f32)
d[i] = ti.random(ti.f32)
a[i] = 0.0
@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
for i in range(N):
acc = 0.0
for j in range(N):
acc += v[i] * d[j]
a[i] = acc
@ti.kernel
def func_lsf(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
for i in range(N):
acc = 0.0
v_val = v[i]
for j in range(N):
acc += v_val * d[j]
a[i] = acc
init(v_arr, d_arr, a_arr)
func(v_arr, d_arr, a_arr)
nIter = 10
st = time.time()
for i in range(nIter):
func(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"Baseline time {(et - st) / nIter}")
st = time.time()
for i in range(nIter):
func_lsf(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"LSF time {(et - st) / nIter}")
Desired performance of this code snippet on RTX3080 (The vulkan shared memory version is not yet merged though)
Full runnable code:
import time import taichi as ti import numpy as np ti.init(arch=ti.vulkan) block_dim = 512 nBlocks = 1024 N = nBlocks * block_dim np.random.seed(42) v_arr = np.random.randn(N).astype(np.float32) d_arr = np.random.randn(N).astype(np.float32) a_arr = np.zeros(N).astype(np.float32) b_arr = np.zeros(N).astype(np.float32) @ti.kernel def init(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)): for i in range(N): v[i] = ti.random(ti.f32) d[i] = ti.random(ti.f32) a[i] = 0.0 @ti.kernel def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)): for i in range(N): acc = 0.0 for j in range(N): acc += v[i] * d[j] a[i] = acc @ti.kernel def func_lsf(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)): for i in range(N): acc = 0.0 v_val = v[i] for j in range(N): acc += v_val * d[j] a[i] = acc init(v_arr, d_arr, a_arr) func(v_arr, d_arr, a_arr) nIter = 10 st = time.time() for i in range(nIter): func(v_arr, d_arr, a_arr) ti.sync() et = time.time() print(f"Baseline time {(et - st) / nIter}") st = time.time() for i in range(nIter): func_lsf(v_arr, d_arr, a_arr) ti.sync() et = time.time() print(f"LSF time {(et - st) / nIter}")
I don't know if I'm testing in a wrong way but on my machine (master branch, RTX3080) baseline time is shorter than LSF time...
[Taichi] version 1.1.3, llvm 10.0.0, commit 9d27b94e, linux, python 3.9.12
[Taichi] Starting on arch=vulkan
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:pick_physical_device@368] Found Vulkan Device 0 (llvmpipe (LLVM 12.0.0, 256 bits))
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:pick_physical_device@368] Found Vulkan Device 1 (NVIDIA GeForce RTX 3080)
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:find_queue_families@139] Async compute queue 2, graphics queue 0
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:find_queue_families@139] Async compute queue 2, graphics queue 0
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:create_logical_device@436] Vulkan Device "NVIDIA GeForce RTX 3080" supports Vulkan 0 version 1.2.175
Baseline time 0.1443796157836914
LSF time 0.5295180082321167