Trevor Gale

Results 40 comments of Trevor Gale

I believe xformers had some modifications to Sputnik kernels so it'd probably make sense to just package those up in the xformers conda package? I'd be open to adding a...

Thanks for the PR! Do you have a unit test we could add to verify the PR fixes the checkpointing issue and make sure we do not break it again...

Hi, thanks! Is the matrix that you have encoded in CSR block-sparse? It will need to be to exploit BSCR and block-sparse computation. If it is, the conversion should be...

Oh interesting! Is there a reason you can use the built-in types/ops for unstructured sparsity in PyTorch?

Hi Eric, there are a few places where we still use CUDA kernels. They're not difficult to replace with Torch ops/Triton kernels, but I'm a bit buried in various other...

Eric, do you know that Scatter MoE is beneficial for your use case or are you interested based on the results from the paper? If the former, it would be...

Thanks, Eric. Can you share more about your use case so that we can include it in our analysis? Scripts to reproduce would be excellent, if possible :)

Hi! Could you share a longer version of the error? Fwiw, the MoE scripts are a bit out of date. I recommend using dMoE, which should work and will be...

We support data, expert and pipeline parallelism in our Megatron integration. We have users training some pretty large models :)

Can you elaborate on what you're trying to do? That script is from Megatron-LM, presumably?