taichi [perf] Enable load store forwarding across loop hierarchy for vulkan performance.

The store_to_load_forwarding transformation in the cfg_optimization pass ignores ExternalPtrStmt. It causes performance issue in the SPH case. The CUDA backend is insensitive to this optimization, but Vulkan backend cannot handles this well even on the same hardware.

I think we can get a ~2.2x speed up for the SPH case when this optimization is enabled.

The related source code section is here.

Jul 07 '22 02:07 turbo0628

After some investigation, this seems not related to Ndarray in the store_to_load_forwarding pass. The code block (as mentioned) does consider Ndarray, which is ExternalPtrStmt: https://github.com/taichi-dev/taichi/blob/233f017f00753986b8aea194b2e30ca7a727159e/taichi/ir/control_flow_graph.cpp#L264-L294

Let be Ndarray or Field, the pattern will start with a GlobalLoadStmt, hence the else-if clause will be executed, and store_forwarding remains true.

The problem here is that the get_store_forwarding_data function cannot find the forwarded data for certain scenarios: https://github.com/taichi-dev/taichi/blob/233f017f00753986b8aea194b2e30ca7a727159e/taichi/ir/control_flow_graph.cpp#L291

For example, a dummy kernel like this can be optimized by store_to_load_forwarding:

@ti.kernel
def foo_ndarray(acc: ti.any_arr(field_dim=1)):
    for i in range(16):
        acc[i] = 1.0
        acc[i] -= i 
        acc[i] %= i

Nevertheless, if the access happens e.g., in a loop, the optimization no longer works:

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        acc[i] = 1.0
        for j in range(8):
            acc[i] -= j 
        for k in range(4):
            acc[i] %= k

This is regardless of Field/Ndarray. The question here becomes how to adapt the get_store_forwarding_data function to consider the second case forwardable.

Jul 11 '22 01:07 qiao-bo

OK, so this is not a cause of ndarray vs field performance gap.

Jul 11 '22 06:07 turbo0628

@jim19930609 @strongoier Any suggestions on this? Should we update the get_store_forwarding_data function to be able to optimize this kind of pattern as well?

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        acc[i] = 1.0
        for j in range(8):
            acc[i] -= j 
        for k in range(4):
            acc[i] %= k

Similar to

@ti.kernel
def foo_field(acc: ti.template()):
    for i in range(16):
        var = acc[i]
        var = 1.0
        for j in range(8):
            var -= j 
        for k in range(4):
            var %= k 
        acc[i] = var

Jul 11 '22 06:07 qiao-bo

I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing?

Jul 11 '22 06:07 strongoier

I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing?

Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful).

Jul 11 '22 06:07 qiao-bo

Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful).

Then feel free to hack it!

Jul 11 '22 06:07 strongoier

Update:

A simplified version that reproduce the problem:

The original baseline version: natural to write, but slow.

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

The fast version that manually make LSF.

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

We save the v[i] outside, as it is actually constant in this loop.

The optimization pass needs 2-steps:

Analyze that the loads that has no relation with the loop variants.
Move the load outside the loop

Another possible solution: This case SHOULD be handled by L1 cache. Is that because some annotations in our SPIR-V codegen is somewhat improper? For example should the v[i] be annotated as in Private scope?

Desired performance of this code snippet on RTX3080 (The vulkan shared memory version is not yet merged though)

Full runnable code:

import time
import taichi as ti
import numpy as np


ti.init(arch=ti.vulkan)

block_dim = 512
nBlocks = 1024
N = nBlocks * block_dim
np.random.seed(42)
v_arr = np.random.randn(N).astype(np.float32)
d_arr = np.random.randn(N).astype(np.float32)
a_arr = np.zeros(N).astype(np.float32)
b_arr = np.zeros(N).astype(np.float32)


@ti.kernel
def init(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        v[i] = ti.random(ti.f32)
        d[i] = ti.random(ti.f32)
        a[i] = 0.0

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

@ti.kernel
def func_lsf(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

init(v_arr, d_arr, a_arr)
func(v_arr, d_arr, a_arr)
nIter = 10
st = time.time()
for i in range(nIter):
    func(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"Baseline time {(et - st) / nIter}")
st = time.time()
for i in range(nIter):
    func_lsf(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"LSF time {(et - st) / nIter}")

Jul 29 '22 03:07 turbo0628

Desired performance of this code snippet on RTX3080 (The vulkan shared memory version is not yet merged though)

Full runnable code:

import time
import taichi as ti
import numpy as np


ti.init(arch=ti.vulkan)

block_dim = 512
nBlocks = 1024
N = nBlocks * block_dim
np.random.seed(42)
v_arr = np.random.randn(N).astype(np.float32)
d_arr = np.random.randn(N).astype(np.float32)
a_arr = np.zeros(N).astype(np.float32)
b_arr = np.zeros(N).astype(np.float32)


@ti.kernel
def init(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        v[i] = ti.random(ti.f32)
        d[i] = ti.random(ti.f32)
        a[i] = 0.0

@ti.kernel
def func(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        for j in range(N):
            acc += v[i] * d[j]
        a[i] = acc

@ti.kernel
def func_lsf(v : ti.types.ndarray(field_dim=1), d : ti.types.ndarray(field_dim=1), a : ti.types.ndarray(field_dim=1)):
    for i in range(N):
        acc = 0.0
        v_val = v[i]
        for j in range(N):
            acc += v_val * d[j]
        a[i] = acc

init(v_arr, d_arr, a_arr)
func(v_arr, d_arr, a_arr)
nIter = 10
st = time.time()
for i in range(nIter):
    func(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"Baseline time {(et - st) / nIter}")
st = time.time()
for i in range(nIter):
    func_lsf(v_arr, d_arr, a_arr)
ti.sync()
et = time.time()
print(f"LSF time {(et - st) / nIter}")

I don't know if I'm testing in a wrong way but on my machine (master branch, RTX3080) baseline time is shorter than LSF time...

[Taichi] version 1.1.3, llvm 10.0.0, commit 9d27b94e, linux, python 3.9.12
[Taichi] Starting on arch=vulkan
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:pick_physical_device@368] Found Vulkan Device 0 (llvmpipe (LLVM 12.0.0, 256 bits))
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:pick_physical_device@368] Found Vulkan Device 1 (NVIDIA GeForce RTX 3080)
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:find_queue_families@139] Async compute queue 2, graphics queue 0
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:find_queue_families@139] Async compute queue 2, graphics queue 0
[I 09/19/22 15:39:00.841 2370085] [vulkan_device_creator.cpp:create_logical_device@436] Vulkan Device "NVIDIA GeForce RTX 3080" supports Vulkan 0 version 1.2.175
Baseline time 0.1443796157836914
LSF time 0.5295180082321167

Sep 19 '22 07:09 lin-hitonami

taichi taichi copied to clipboard

[perf] Enable load store forwarding across loop hierarchy for vulkan performance.

taichi
taichi copied to clipboard