Amin Sedaghat

Results 3 issues of Amin Sedaghat

Add NCHW BatchNorm forward (two-pass; fp32 accum). API: `triton_kernels.batchnorm_forward(...)` → `(y, mean, var)`. - Tests: 21 pass vs PyTorch across 2D/4D, fp32/fp16/bf16 (tols fp32 1e-5/1e-6; half 3e-2/3e-3). - Perf (RTX...

## Summary - keep `ThreadReduce` accumulator types pinned to the block value type across `BlockScan` and `BlockReduce` - apply the same accumulator fix to the raking specialization so all paths...

Implement type-safe `cuda::std::ffs` function as replacement for `__ffs` intrinsic. * Returns 1-based index of first set bit (0 for no bits set) * Works on all platforms and integer types...