Yukun He

Results 4 issues of Yukun He

Unify the two versions of AllReduce op in Module and custome op levels.

* Add a specific environment variable to control the logger level of AutoTuner. * Add statistics to track the total profiling time for each op. This will help determine the...

We find that release/1.1 also has this issue and may have a potential perf drop. ## Summary by CodeRabbit * **Refactor** * Optimized internal tensor allocation for NVFP4 uint8 operations...

@coderabbitai summary ## Description ## Test Coverage ## PR Checklist Please review the following before submitting your PR: - PR description clearly explains what and why. If using CodeRabbit's summary,...