Figconvnet performance improvements

Open coreyjadams opened this issue 9 months ago • 0 comments

PhysicsNeMo Pull Request

Description

While investigating how to make FigConvNet and DoMINO domain parallel, I ran a profile of FigConvNet and discovered some low hanging fruit for performance improvements. This PR addresses them. I'll summarize them below but first some selected results. At Batch size 1, we see over 2x improvement on A100:

And that's a 3x improvement on Hopper:

Batch size 8 can not fit the larger image size, but for smaller images we see 2.5x improvement on A100:

And 2x improvement on Hopper:

The changes:

a number of functions, most notably grid_init, were doing data transfers during the forward pass of the network.
The layer normalization layers are significantly more efficient when leveraging transformer engine. It's now a configurable option in the network.
Emptying the cuda cache is a bottle neck in the training script. This makes memory constraints slightly harder but manageable.
Finally, the warp-based radius search introduces an unavoidable sync point. Original implementation did this once per batch-item (B times total). I've refactored it to sync once per Batch - this gives a few percent boost at larger batch sizes.

What's left on the table?

The model has significant work that is independent (point -> grid, and grid->point) and could be parallelized into streams.
- Because of the blocking CPU transfers in the warp kernel, we can't get a boost doing that yet. It would need threading, too.
The radius search could be accelerated with steam concurrency too. However, it's part torch, part warp, and I was seeing race conditions and illegal memory access in certain cases. It's not included here but it's possible in the future.
The central part of the network, performing the grid down/up blocks, suffers from poor GPU occupancy. It could be accelerated with cuda graphs and kernel fusion but the organization of the models makes it a challenge: those techniques require to take pure tensors in/out but the input/output is python classes containing tensors.

So, I stopped there, it's still better than it was!

Checklist

[x] I am familiar with the Contributing Guidelines.
[ ] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.
[ ] The CHANGELOG.md is up to date with these changes.
[x] An issue is linked to this pull request.

Dependencies

Mar 21 '25 21:03 coreyjadams