Tri Dao

Results 432 comments of Tri Dao
trafficstars

@OliverHxh do you have an idea on how to do discretization in pytorch while remaining efficient?

> "We use the Pile dataset (L. Gao, Biderman, et al. 2020), and follow the training recipe described in Brown et al. (2020)." Is this for the transformer models, or...

In general, yes. Which flavor of sequence parallelism are you referring to? The one in Megatron-LM?

Nothing is built-in, but it'll be implemented in the future.

Do you want to send a PR for the conv1d? The selective_scan operation is also implemented in CUDA, but there's a reference implementation in Pytorch (probably quite slow).

Yes, there should be ways to deal with variable length. It's not implemented yet however.

It's theoretically possible to process variable lengths / packed sequences, but the implementation will be a bit tricky. Parallelizing over seq_len dimension reduces to how one would parallelize associative scan...

The model seems very small, but the GPU also only has 4GB of memory? Maybe try different layers (e.g. MLP) of similar sizes to see if that also OOM. If...

We don't have experience with Tensorflow but we welcome contributions.