Horace He comments

Results 242 comments of


                                            Horace He

Torch compiled FLCE is 2x faster than the current FLCE

@wizyoung how are you setting the chunk size? I wasn't able to get the liger kernel to perform much better even when changing the chunk size.

Torch compiled FLCE is 2x faster than the current FLCE

@wizyoung I agree there's some additional memory overhead (in particular, I think we don't inplace the addmm), but the additional memory is generally pretty negligible here, no? For example, if...

Torch compiled FLCE is 2x faster than the current FLCE

@ekojsalim In my brief testing, it seems like it's both faster and uses less memory.

Torch compiled FLCE is 2x faster than the current FLCE

@wizyoung Can you post your benchmark script?

Integrate Flex Decoding

@merrymercy We run on nerfed H100s internally at Meta with only 2.4 TB/s of bandwidth, so these numbers aren't 1:1 comparable. But it's a good comparison :)

triton doesn't seem to have any way to implement scatter_add

``` src_tensor = torch.tensor([0, 0, 0], device='cuda') index_tensor = torch.tensor([0, 1, 2, 2, 1, 0], device='cuda') to_add_tensor = torch.tensor([1, 1, 1, 1, 1, 1], device='cuda') @torch.compile def f(a, b, c):...

triton doesn't seem to have any way to implement scatter_add

``` src_tensor = torch.tensor([0, 0, 0], device='cuda', dtype=torch.float32) index_tensor = torch.tensor([0, 1, 2, 2, 1, 0], device='cuda') to_add_tensor = torch.tensor([1, 1, 1, 1, 1, 1], device='cuda') @torch.compile def f(a, b,...

Folded code unfolds if you spend some (unknown number) of ms idle in a fold

I polished up my fix over here: https://github.com/VSCodeVim/Vim/pull/1552 It does some hackish stuff, but it seems to work great for me. Here's a pretty bad demo: ![test](https://cloud.githubusercontent.com/assets/6355099/26090766/00ab8dc4-39d5-11e7-99cf-53d2741e1bc9.gif)

Folded code unfolds if you spend some (unknown number) of ms idle in a fold

We just pushed out a new version, so enable `vim.foldfix` and tell us what you think! It's a hack and not a proper solution to this problem, so we'll probably...

Folded code unfolds if you spend some (unknown number) of ms idle in a fold

Has anybody tried out the new fix yet? Do you guys think it's workable as a proper fix (ie: this issue can be closed)?