Liger-Kernel
Liger-Kernel copied to clipboard
add batch_norm op with test and benchmark
Summary
Implemented a 2D batch normalization Triton operator, successfully ran the corresponding tests and benchmarks, and visualized the performance tests for speed and memory.
Testing Done
- Hardware Type: <BLANK>
- [x] run
make testto ensure correctness - [x] run
make checkstyleto ensure code style - [x] run
make test-convergenceto ensure convergence
the visualization of performance:
looks like from the benchmark result triton impl is slower than HF original one? 👀
looks like from the benchmark result triton impl is slower than HF original one? 👀
It seems so. The memory usage is about the same, but the speed is a bit slower. Do you have any optimization or improvement methods?