Mathieu Poumeyrol
Mathieu Poumeyrol
Mmm... you're right, it a bit more complicated than BLAS xDOT, because there is an indirection. The idea here is that `inner_loop` is called for each value to compute in...
> Will try the first option. > > To clarify though, offset_pairs would be a 2D size_t array right? And you'd need to pass input_center_offset, right? offset_pairs should match what...
> Ah, one other quick thing: for fp16, it looks like the dummy fmla file for pragma testing should specify fullfp16 for fp16 arithmetic support (fp16 on its own apparently...
Well done. Benched it already ? Indeed, nightly is a no-go. Unless I'm mistaken our example is pretty simple here, the only selection parameter being the type. So instead of...
Mmm... I don't think the dispatch_* macros will help you there: they are useful when you find yourself in some code which is abstract over Tensor and DatumType, not type-parametric,...
I'm surprised by the need to multiply by sizeof here. https://github.com/sonos/tract/compare/main...VariantXYZ:move_dot_prod#diff-9fd28fecd20773f8aeba8df47573913c130ce5dce2179af59463a1f57558ca4eR14 Maybe the source of confusion here is about add and offset semantics around rust pointer: they are doing the...
About inline(never), I'm using this sometimes because in some situations the compiler picks the wrong variables to set in registers and what to put on the stack. Separating the function...
Yeah, it feels a bit like a dark art. The Cortex-A57 is an out-of-order chip, so the "_gen" variants will probably operate pretty well. There is a simpler test you...
> 32x9x96 (3% of total time) uses 8x8, but 16x4 is ~3% better (not an issue, it’s very minor) That one honestly baffles me. I've seen it before, so you...