nestedtensor icon indicating copy to clipboard operation
nestedtensor copied to clipboard

Regarding the performance of tensorwise

Open justanhduc opened this issue 5 years ago • 1 comments

I found out that tensorwise actually just runs a for loop over the nested tensors. I benchmarked tensorwise against map, list comprehension and for loop. (Un)surprisingly, tensorwise performs much slower than the others. Here is the benchmark

import torch as T
import nestedtensor as nt

crit = lambda x, y: T.mean((x - y) ** 2)


@nt.tensorwise()
def loss_nt(a, b):
    return crit(a, b)


def loss_map(a, b):
    return sum(map(crit, a, b)) / len(a)


def loss_for(a, b):
    return sum([crit(a_, b_) for a_, b_ in zip(a, b)]) / len(a)


def loss_expfor(a, b):
    loss = []
    for a_, b_ in zip(a, b):
        loss.append(crit(a_, b_))
    return sum(loss) / len(loss)


p1 = T.arange(64 * 5000 * 3).cuda().view(64, 5000, 3).float()
p2 = T.arange(64 * 5000 * 3).cuda().view(64, 5000, 3).float()

p1_list = list(p1[:, None])
p2_list = list(p2[:, None])

p1_nt = nt.as_nested_tensor(p1_list).cuda()
p2_nt = nt.as_nested_tensor(p2_list).cuda()

start = T.cuda.Event(enable_timing=True)
end = T.cuda.Event(enable_timing=True)

for i in range(100):
    start.record()
    loss_nt(p1_nt, p2_nt)
    end.record()
    T.cuda.synchronize()
    total_nt = start.elapsed_time(end)

    start.record()
    loss_map(p1_list, p2_list)
    end.record()
    T.cuda.synchronize()
    total_map = start.elapsed_time(end)

    start.record()
    loss_for(p1_list, p2_list)
    end.record()
    T.cuda.synchronize()
    total_for = start.elapsed_time(end)

    start.record()
    crit(p1, p2)
    end.record()
    T.cuda.synchronize()
    total = start.elapsed_time(end)

    start.record()
    loss_expfor(p1_list, p2_list)
    end.record()
    T.cuda.synchronize()
    total_expfor = start.elapsed_time(end)

    print(i, total_nt, total_map, total_for, total_expfor, total)

Is it because tensorwise is not in C++ yet? If the implementation of tensorwise is final then I wonder if tensorwise is just for convenience, not for performance?

justanhduc avatar Dec 14 '19 08:12 justanhduc

Thanks for posting this issue!

Indeed tensorwise is slow at the moment because it is not written in C++. I'm working on a C++ version that accepted a JIT-ed function and then executes efficiently, which will remove a lot of the overhead and help with performance.

EDIT: I'll get to this after moving the NestedTensor class into C++

cpuhrsch avatar Dec 16 '19 19:12 cpuhrsch