Horace He
Horace He
@wizyoung how are you setting the chunk size? I wasn't able to get the liger kernel to perform much better even when changing the chunk size.
@wizyoung I agree there's some additional memory overhead (in particular, I think we don't inplace the addmm), but the additional memory is generally pretty negligible here, no? For example, if...
@ekojsalim In my brief testing, it seems like it's both faster and uses less memory.
@wizyoung Can you post your benchmark script?
@merrymercy We run on nerfed H100s internally at Meta with only 2.4 TB/s of bandwidth, so these numbers aren't 1:1 comparable. But it's a good comparison :)
``` src_tensor = torch.tensor([0, 0, 0], device='cuda') index_tensor = torch.tensor([0, 1, 2, 2, 1, 0], device='cuda') to_add_tensor = torch.tensor([1, 1, 1, 1, 1, 1], device='cuda') @torch.compile def f(a, b, c):...
``` src_tensor = torch.tensor([0, 0, 0], device='cuda', dtype=torch.float32) index_tensor = torch.tensor([0, 1, 2, 2, 1, 0], device='cuda') to_add_tensor = torch.tensor([1, 1, 1, 1, 1, 1], device='cuda') @torch.compile def f(a, b,...
I polished up my fix over here: https://github.com/VSCodeVim/Vim/pull/1552 It does some hackish stuff, but it seems to work great for me. Here's a pretty bad demo: 
We just pushed out a new version, so enable `vim.foldfix` and tell us what you think! It's a hack and not a proper solution to this problem, so we'll probably...
Has anybody tried out the new fix yet? Do you guys think it's workable as a proper fix (ie: this issue can be closed)?