Benoit Jacob
Benoit Jacob
Right. Besides the alternative you lay out here between two possibilities, 1. Rematerialization, as currently done on GPU, which is unpalatable on CPU due to the 2x exp cost. 2....
Yup, ~ 30 was the guesstimate I was about to say before you edited :-) So we agree about how much (or how little, depending on how you view it)...
Will all that said, though, since softmax rarely dominates e2e profiles, if it comes down to just the above https://github.com/iree-org/iree/issues/17469#issuecomment-2125388885 alternative, i think i'd still prefer rematerialization (and pay the...
1D softmax is too small (and sequential) to be usefully distributed to multiple threads. N-D softmax has those N-1 parallel dimensions that works well for distribution, and then each thread...
Back-of-envelope calculation: if the loop body loads 512bits = 64 bytes and performs 30 AVX-512 instructions on it, issuing in average 1 such instruction per cycle in the loop body...
I'm no hardware expert, but looking at exp's implementation as a sequence of instructions, it seems inherently costly, so if a circuit is able to do it all under a...
Note: x87 used to have single-instruction FSIN, FCOS and FSINCOS. But, I checked, somehow it didn't have FEXP. Crazy! I guess that drawing perfect ellipses in early 2D graphics was...
> So I'm curious if we know for certain that rematerializing 20-30 ALU ops is always going to be a significant loss over the two dispatches and a global allocation....
@benvanik, I mulled a way to summarize some of the above discussion as a table. My high-level point here is that the 2 in the first row is much smaller...
@lialan , this diff fixes the issue we were seeing on riscv CI, really an issue about properly handling the case where encoding materialization fails. The issue was that we...