nestedtensor
nestedtensor copied to clipboard
Regarding the performance of tensorwise
I found out that tensorwise actually just runs a for loop over the nested tensors. I benchmarked tensorwise against map, list comprehension and for loop. (Un)surprisingly, tensorwise performs much slower than the others. Here is the benchmark
import torch as T
import nestedtensor as nt
crit = lambda x, y: T.mean((x - y) ** 2)
@nt.tensorwise()
def loss_nt(a, b):
return crit(a, b)
def loss_map(a, b):
return sum(map(crit, a, b)) / len(a)
def loss_for(a, b):
return sum([crit(a_, b_) for a_, b_ in zip(a, b)]) / len(a)
def loss_expfor(a, b):
loss = []
for a_, b_ in zip(a, b):
loss.append(crit(a_, b_))
return sum(loss) / len(loss)
p1 = T.arange(64 * 5000 * 3).cuda().view(64, 5000, 3).float()
p2 = T.arange(64 * 5000 * 3).cuda().view(64, 5000, 3).float()
p1_list = list(p1[:, None])
p2_list = list(p2[:, None])
p1_nt = nt.as_nested_tensor(p1_list).cuda()
p2_nt = nt.as_nested_tensor(p2_list).cuda()
start = T.cuda.Event(enable_timing=True)
end = T.cuda.Event(enable_timing=True)
for i in range(100):
start.record()
loss_nt(p1_nt, p2_nt)
end.record()
T.cuda.synchronize()
total_nt = start.elapsed_time(end)
start.record()
loss_map(p1_list, p2_list)
end.record()
T.cuda.synchronize()
total_map = start.elapsed_time(end)
start.record()
loss_for(p1_list, p2_list)
end.record()
T.cuda.synchronize()
total_for = start.elapsed_time(end)
start.record()
crit(p1, p2)
end.record()
T.cuda.synchronize()
total = start.elapsed_time(end)
start.record()
loss_expfor(p1_list, p2_list)
end.record()
T.cuda.synchronize()
total_expfor = start.elapsed_time(end)
print(i, total_nt, total_map, total_for, total_expfor, total)
Is it because tensorwise is not in C++ yet?
If the implementation of tensorwise is final then I wonder if tensorwise is just for convenience, not for performance?
Thanks for posting this issue!
Indeed tensorwise is slow at the moment because it is not written in C++. I'm working on a C++ version that accepted a JIT-ed function and then executes efficiently, which will remove a lot of the overhead and help with performance.
EDIT: I'll get to this after moving the NestedTensor class into C++