Xilun Wu
Xilun Wu
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #92069 * __->__ #91802 * #91801 * #91756
**What the problem is:** Both single-node and sharded `TensorParallelMultiheadAttention`(#477) modules diverge (the forward output becomes `-inf` after less than 10 iterations). Also they produce different forward output of which the...
**What the problem is:** - Sharded `TensorParallelMultiheadAttention`(#477) module fails to update `proj.bias` parameter though the back-propagated **gradient is correct**. - Also, this error doesn't occur on rank 0. **How to...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #364
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #592
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #1160
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #1901 * #1897 * #1884 * #1883 * #1882
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #1901 * __->__ #1897 * #1884 * #1883 * #1882
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #1901 * #1897 * #1884 * __->__ #1883 * #1882 This PR uses the latest CP APIs to enable FlexAttention + CP for...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #1901 * #1897 * #1884 * #1883 * __->__ #1882 freqs_cis is sensitive to the sequence order. CP load balancing will shuffle the...