[BUG]: Accessing comp time vector is slower than accessing a runtime vector

Open andresnowak opened this issue 1 year ago • 0 comments

Bug description

This problem doesn't happen if you just create a simple compile time vector and then try to access it's values it happens in a more complex example (like doing operations in a tensor and using the comp vector as the strides to access the tensor). And this problem can make the program run 10 times or 100 times slower.

Steps to reproduce

from benchmark import benchmark
from tensor import Tensor, TensorShape
 
alias dtype = DType.float32
 
 
@always_inline
fn calculate_strides(shape: TensorShape) -> DynamicVector[Int]:
    var strides = DynamicVector[Int]()
    strides.resize(shape.rank(), 1)
 
    for i in range(shape.rank() - 2, -1, -1):
        strides[i] = strides[i + 1] * shape[i + 1]
 
    return strides ^
 
 
fn main():
    alias shape = TensorShape(4, 16, 28, 28)
    var res_tensor = Tensor[dtype](shape)
    var a = Tensor[dtype](shape)
    var b = Tensor[dtype](shape)
 
    alias strides_res_comp = calculate_strides(shape)
    alias strides_a_comp = calculate_strides(shape)
    alias strides_b_comp = calculate_strides(shape)
 
    var strides_res_runt = calculate_strides(shape)
    var strides_a_runt = calculate_strides(shape)
    var strides_b_runt = calculate_strides(shape)
 
    @parameter
    fn wrapper_comp():
 
        for i in range(shape.num_elements() // (shape[3] * shape[2])):
            for j in range(shape[2]):
                for k in range(shape[3]):
                    var pos_res = (
                        i * strides_res_comp[0] + j * strides_res_comp[1] + k * strides_res_comp[2]
                    )
                    var pos_a = (
                        i * strides_a_comp[0] + j * strides_a_comp[1] + k * strides_a_comp[2]
                    )
                    var pos_b = (
                        i * strides_b_comp[0] + j * strides_b_comp[1] + k * strides_b_comp[2]
                    )
 
                    res_tensor[pos_res] = a[pos_a] + b[pos_b]
 
    @parameter
    fn wrapper_runtime():
        for i in range(shape.num_elements() // (shape[3] * shape[2])):
            for j in range(shape[2]):
                for k in range(shape[3]):
                    var pos_res = (
                        i * strides_res_runt[0] + j * strides_res_runt[1] + k * strides_res_runt[2]
                    )
                    var pos_a = (
                        i * strides_a_runt[0] + j * strides_a_runt[1] + k * strides_a_runt[2]
                    )
                    var pos_b = (
                        i * strides_b_runt[0] + j * strides_b_runt[1] + k * strides_b_runt[2]
                    )
 
                    res_tensor[pos_res] = a[pos_a] + b[pos_b]
 
    var runtime_bench = benchmark.run[wrapper_runtime]()
    runtime_bench.print()
 
    var comp_bench = benchmark.run[wrapper_comp]()
    comp_bench.print()
 
    _ = (res_tensor, a, b, strides_a_runt, strides_b_runt, strides_res_runt)

Results:

---------------------
Benchmark Report (s)
---------------------
Mean: 0.00010807986445046594
Total: 4.9060692870000002
Iters: 45393
Warmup Mean: 0.00035835499999999998
Warmup Total: 0.00071670999999999996
Warmup Iters: 2
Fastest Mean: 0.0
Slowest Mean: 0.00016379333886318572
 
---------------------
Benchmark Report (s)
---------------------
Mean: 0.034970499422310758
Total: 8.7775953550000008
Iters: 251
Warmup Mean: 0.022904773
Warmup Total: 0.045809546
Warmup Iters: 2
Fastest Mean: 0.023195234536
Slowest Mean: inf

P.S: In this example also a crash happens, but the benchmark shows the difference in speed.

System information

- What OS did you do install Mojo on ? PopOs
- Provide version information for Mojo by pasting the output of `mojo -v`: mojo 24.1.0 (55ec12d6)
- Provide Modular CLI version by pasting the output of `modular -v`: modular 0.5.1 (1b608e3d)

Mar 13 '24 05:03 andresnowak