Alexander Wettig
Alexander Wettig
## 🐛 Bug The `target_tokens` variable in the forward call of the Data2VecTextEncoder model contains only the tokens at masked positions and padding tokens otherwise. In the method described in...
Hey! I'm a big fan of the flash attention varlen kernels, and they are fantastic for saving the memory & compute of pad tokens. When training with fixed batches of...
## Environment - mosaicml-streaming==0.7.5 ## To reproduce Steps to reproduce the behavior: 1. Use `StreamingDataset` in distributed training with the same seed and set `replication` either to None or an...