Benoit Jacob comments

Results 119 comments of


                                            Benoit Jacob

Move DecomposeSoftmax to GlobalOptimization.

Right. Besides the alternative you lay out here between two possibilities, 1. Rematerialization, as currently done on GPU, which is unpalatable on CPU due to the 2x exp cost. 2....

Move DecomposeSoftmax to GlobalOptimization.

Yup, ~ 30 was the guesstimate I was about to say before you edited :-) So we agree about how much (or how little, depending on how you view it)...

Move DecomposeSoftmax to GlobalOptimization.

Will all that said, though, since softmax rarely dominates e2e profiles, if it comes down to just the above https://github.com/iree-org/iree/issues/17469#issuecomment-2125388885 alternative, i think i'd still prefer rematerialization (and pay the...

Move DecomposeSoftmax to GlobalOptimization.

1D softmax is too small (and sequential) to be usefully distributed to multiple threads. N-D softmax has those N-1 parallel dimensions that works well for distribution, and then each thread...

Move DecomposeSoftmax to GlobalOptimization.

Back-of-envelope calculation: if the loop body loads 512bits = 64 bytes and performs 30 AVX-512 instructions on it, issuing in average 1 such instruction per cycle in the loop body...

Move DecomposeSoftmax to GlobalOptimization.

I'm no hardware expert, but looking at exp's implementation as a sequence of instructions, it seems inherently costly, so if a circuit is able to do it all under a...

Move DecomposeSoftmax to GlobalOptimization.

Note: x87 used to have single-instruction FSIN, FCOS and FSINCOS. But, I checked, somehow it didn't have FEXP. Crazy! I guess that drawing perfect ellipses in early 2D graphics was...

Move DecomposeSoftmax to GlobalOptimization.

> So I'm curious if we know for certain that rematerializing 20-30 ALU ops is always going to be a significant loss over the two dispatches and a global allocation....

Move DecomposeSoftmax to GlobalOptimization.

@benvanik, I mulled a way to summarize some of the above discussion as a table. My high-level point here is that the 2 in the first row is much smaller...

Data tiling: transpose narrow-N into narrow-M

@lialan , this diff fixes the issue we were seeing on riscv CI, really an issue about properly handling the case where encoding materialization fails. The issue was that we...