Scott Wolchok comments

Results 66 comments of


                                            Scott Wolchok

Unbreak CUDA 11.4 build of Half.h

@pytorchbot merge -f "told OK to bypass by @atalman "

Introducing flag to unescape subst string

> @swolchok has imported this pull request. If you are a Meta employee, you can view this diff [on Phabricator](https://www.internalfb.com/diff/D56473915). the import failed due to conflicts

x86 CPU: BF16 should improve decoding performance relative to FP32 on x86, even without hardware BF16

FP16 is disproportionately slow on x86 as well; a similar approach should improve performance there

x86 CPU: BF16 should improve decoding performance relative to FP32 on x86, even without hardware BF16

> FP16 on x86 Concretely, for stories110M: (llama3.2-1b took longer than I was willing to wait for fp16) ``` fp32: Average tokens/sec (total): 65.81 Average tokens/sec (first token): 27.89 Average...

x86 CPU: BF16 should improve decoding performance relative to FP32 on x86, even without hardware BF16

I've started work on generalizing the ARM fp16/bf16 gemv fast path code to use at::vec::Vectorized, which will lead to generalizing it to x86 and using it over MKL when cpuinfo...

x86 CPU: BF16 should improve decoding performance relative to FP32 on x86, even without hardware BF16

There are inductor issues lower in the stack right now, but https://github.com/pytorch/pytorch/pull/138005 should solve the FP16 portion of this when it's ready, and BF16 is a matter of follow-up.

x86 CPU: BF16 should improve decoding performance relative to FP32 on x86, even without hardware BF16

https://github.com/pytorch/pytorch/pull/139220 was merged last week, so the only thing left should be to update the pytorch pin. didn't realize this got closed because a commit mentioned it; reopening until verified.

Save some size in pattern/{bitwise,comparison}_op.h

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #10491 * #10490 * __->__ #10489

Reapply #9841: Migrate elementwise_util callers to the variants with out_dtypes in template arguments

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #10491

Reapply #9841: Migrate elementwise_util callers to the variants with out_dtypes in template arguments

internal diff number for size check on this stack is D73691545