Adamo Young
Results
3
comments of
Adamo Young
trafficstars
My guess is there is no difference, based on how the masks are used in the [Attention class](https://github.com/lucidrains/x-transformers/blob/c1283da7f4d87ecfe583f305f7c988097987766c/x_transformers/x_transformers.py#L384)
Thanks! Do you know if other implementations tend to do this as well? In `pytorch_geometric` they allow for graph batching where the memory usage scales with the number of nodes/edges...
Thanks, that's a good solution! Will check it out.