tract
tract copied to clipboard
DepthWise conv Inner loop f16 support
https://github.com/Rikorose/DeepFilterNet/pull/211#issuecomment-1353637586
Digging in a bit into why I was seeing so many f32/f16 conversions despite the A55 supporting fp16 storage and arithmetic, it seems like this is just a limitation of Rust’s f16 support.
To fully take advantage of FP16, I think avoiding these conversions is necessary… though, I’m not sure what the best solution is…
Maybe just rewriting the inner loop in assembly for f16 when the CPU says it supports f16?
Overriding the operators in the half crate might work too.