Scott Wolchok

Results 66 comments of Scott Wolchok

@pytorchbot merge -f "told OK to bypass by @atalman "

> @swolchok has imported this pull request. If you are a Meta employee, you can view this diff [on Phabricator](https://www.internalfb.com/diff/D56473915). the import failed due to conflicts

FP16 is disproportionately slow on x86 as well; a similar approach should improve performance there

> FP16 on x86 Concretely, for stories110M: (llama3.2-1b took longer than I was willing to wait for fp16) ``` fp32: Average tokens/sec (total): 65.81 Average tokens/sec (first token): 27.89 Average...

I've started work on generalizing the ARM fp16/bf16 gemv fast path code to use at::vec::Vectorized, which will lead to generalizing it to x86 and using it over MKL when cpuinfo...

There are inductor issues lower in the stack right now, but https://github.com/pytorch/pytorch/pull/138005 should solve the FP16 portion of this when it's ready, and BF16 is a matter of follow-up.

https://github.com/pytorch/pytorch/pull/139220 was merged last week, so the only thing left should be to update the pytorch pin. didn't realize this got closed because a commit mentioned it; reopening until verified.

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #10491 * #10490 * __->__ #10489

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #10491