Scott Wolchok
Scott Wolchok
@pytorchbot merge -f "told OK to bypass by @atalman "
> @swolchok has imported this pull request. If you are a Meta employee, you can view this diff [on Phabricator](https://www.internalfb.com/diff/D56473915). the import failed due to conflicts
FP16 is disproportionately slow on x86 as well; a similar approach should improve performance there
> FP16 on x86 Concretely, for stories110M: (llama3.2-1b took longer than I was willing to wait for fp16) ``` fp32: Average tokens/sec (total): 65.81 Average tokens/sec (first token): 27.89 Average...
I've started work on generalizing the ARM fp16/bf16 gemv fast path code to use at::vec::Vectorized, which will lead to generalizing it to x86 and using it over MKL when cpuinfo...
There are inductor issues lower in the stack right now, but https://github.com/pytorch/pytorch/pull/138005 should solve the FP16 portion of this when it's ready, and BF16 is a matter of follow-up.
https://github.com/pytorch/pytorch/pull/139220 was merged last week, so the only thing left should be to update the pytorch pin. didn't realize this got closed because a commit mentioned it; reopening until verified.
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #10491 * #10490 * __->__ #10489
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #10491
internal diff number for size check on this stack is D73691545