Xilun Wu

Results 11 issues of Xilun Wu

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #92069 * __->__ #91802 * #91801 * #91756

topic: not user facing

**What the problem is:** Both single-node and sharded `TensorParallelMultiheadAttention`(#477) modules diverge (the forward output becomes `-inf` after less than 10 iterations). Also they produce different forward output of which the...

**What the problem is:** - Sharded `TensorParallelMultiheadAttention`(#477) module fails to update `proj.bias` parameter though the back-propagated **gradient is correct**. - Also, this error doesn't occur on rank 0. **How to...

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #364

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #592

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #1160

CLA Signed
module: context parallel

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #1901 * #1897 * #1884 * #1883 * #1882

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #1901 * __->__ #1897 * #1884 * #1883 * #1882

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #1901 * #1897 * #1884 * __->__ #1883 * #1882 This PR uses the latest CP APIs to enable FlexAttention + CP for...

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #1901 * #1897 * #1884 * #1883 * __->__ #1882 freqs_cis is sensitive to the sequence order. CP load balancing will shuffle the...

CLA Signed