carbotaniuman
carbotaniuman
I have a commit for this alongside other half changes in #1710.
I expect #1903 will fix this.
I think it would be slightly better for memory access patterns if we did a shuffle instead of changing index, but that is probably worse if we need to do...
I've finished testing this on CUDA + host on both SSCP and SMCP, using the in-place changes. I've also chosen a new implementation option that hopefully is more maintainable and...