Tri Dao

Results 440 comments of Tri Dao
trafficstars

The code should already be using nn.Conv1d

As mentioned in the README, causal-conv1d is optional

You can look around for similar issues reported on Github

conv1d helps a bit on perplexity: e.g. at 360M on 7B tokens on the Pile, using GPT2 tokenizer, w conv1d we get around 8.6 perplexity and without conv1d is around...

If you use a large model the triton overhead will be neglibile.

Can you give a short script to reproduce the issue? E.g. for these specific tensors, the gradients are wrong / very large.

You can try not using HF's transformer but use the code from this repo. As shown in the README.

No, the context length is whatever sequence length you use as inputs. We typically use kernel size 2, 3, 4 for the conv1d