Charlie Durham
Charlie Durham
@Junkai-Wu Thanks! I think you answered my question. Just wanted to hear that the permutation was enough to explain how the cutlass kernel maintained a consistent fp error. (my version...
@llukas They look better! However, It would be really great if the fused matrix multiply examples were set up for multiple blocks. I don't really understand why any of the...
I see. Now i'm reading the ptx doc diagrams and it makes sense. I didn't realize the output was N-major for these instruction also. So expecting M and N major...
@hwu36 Thanks that makes sense to me now. I changed the swizzle to and I'm down to ~6 conflicts per block now. What looks wrong about the layout I have?...
Ok, I roughly followed an example from f16 from you guys that had a nested shape to compose the swizzle with and I did ``` auto swizzle_atom = composition( Swizzle{},...
@cceka thanks for the kind words! I see, thats a shame. A full worked cute example is so nice and hackable off the shelf. Maybe the rtx 5 series going...
I'm not one of the cutlass/cute devs but if you take something like the sgemm_sm80.cu you can look at the sizes of the modes of the tCrA, tCrB, tCrC fragments...
Ohhh, i'm sorry. I guess i projected my use-cases on to yours. I had a similar problem to yours and I got it going by recruiting that cute example. I'm...