Tri Dao comments

Results 440 comments of


                                            Tri Dao

trafficstars

How can I avoid using causal-conv1d?

The code should already be using nn.Conv1d

How can I avoid using causal-conv1d?

As mentioned in the README, causal-conv1d is optional

How can I avoid using causal-conv1d?

Yup

How can I avoid using causal-conv1d?

You can look around for similar issues reported on Github

How can I avoid using causal-conv1d?

conv1d helps a bit on perplexity: e.g. at 360M on 7B tokens on the Pile, using GPT2 tokenizer, w conv1d we get around 8.6 perplexity and without conv1d is around...

Replace mamba1 with mamba2 and training becomes very slow!

If you use a large model the triton overhead will be neglibile.

Exploding gradients if ngroups is higher than 1.

Can you give a short script to reproduce the issue? E.g. for these specific tensors, the gradients are wrong / very large.

Size mismatchs when loading mamba2-130m

You can try not using HF's transformer but use the code from this repo. As shown in the README.

Size mismatchs when loading mamba2-130m

Yes

clarification on how to interpret kernel size for conv1d

No, the context length is whatever sequence length you use as inputs. We typically use kernel size 2, 3, 4 for the conv1d