Tri Dao
Tri Dao
The code should already be using nn.Conv1d
As mentioned in the README, causal-conv1d is optional
You can look around for similar issues reported on Github
conv1d helps a bit on perplexity: e.g. at 360M on 7B tokens on the Pile, using GPT2 tokenizer, w conv1d we get around 8.6 perplexity and without conv1d is around...
If you use a large model the triton overhead will be neglibile.
Can you give a short script to reproduce the issue? E.g. for these specific tensors, the gradients are wrong / very large.
You can try not using HF's transformer but use the code from this repo. As shown in the README.
No, the context length is whatever sequence length you use as inputs. We typically use kernel size 2, 3, 4 for the conv1d