long-context-attention
long-context-attention copied to clipboard
GPU Memory Usage
Hi, Thanks for your awesome work. In my test on 8xA800, why using USP with ulysses_degree=8 and ring_degree=1 would take more GPU memory than naive Ulysses?
All2All needs some tmp buffer for async P2P. could you post the memory difference? It is very small according to my experience.