colmap
colmap copied to clipboard
Unable to get COLMAP MVS above ~25% GPU power usage
Issue
Running patch_match_stereo on an A100 GPU the GPU "util" fills up, but the power usage is extremely low, see graph below:
This seems to indicate that there are some cuda kernels in patchmatch that take a lot of "clock-time" but are not using the GPU cores effectively.
Some attempts to fix I've tried
- Tried running multiple MVS in multiprocess (above is from such a run). However, I think the 100% util is basically hard-limiting the performance.
- Tried moving data to fast RAM (hoping it was IO issue), basically no difference.
I'm wondering if this is reproducible across systems. Is there any fix?
System
Docker container with: docker://colmap/colmap:20231001.8
It's fine for me that it's a bit inefficient, but the cluster I'm running on automatically kills jobs that go below 25% power, which is quite frustrating, so I'd like to fix this.
I got stuff to run faster by upping THREADS_PER_BLOCK from 32 -> 96 (64 also works fine), on my fork. This gave about 2x speedup for me. However, it's still just using about 30% power. It feels like there are major bottlenecks left. What is likely the culprit here @ahojnnes ? I can change in my fork and report back if you have any hunches.
The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).
@ahojnnes would this be similar to what I said regarding multiple processes in parallel on the same GPU? Because I had almost no success in speeding up with that approach. I will try my luck profiling the code to see what seems to be taking most time.
If you tried that, it should be mostly the same as my suggestion above. I am not too familiar with latest GPU architectures. The A100 does have specialized tensor cores for matrix multiplication that cannot be leveraged for patch match stereo, so this may explain the behavior you see.
Since SweepFromTopToBottom takes up basically entire computation time it's difficult to tell haha. Perhaps I can split up the kernel for the purpose of debugging?
Actually: I used the wrong profiler, apparently the one to use is nsight-compute (not nsight-systems), I'll try running the compute version and see if I can get a more detailed report. This is seemingly the way to do it.
EDIT2: I think this thread might reveal how stupid I am, I really don't know how to code.
The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).
Basically this seems like the culprit. From talking to people who know more about cuda than me, even if we don't use all the threads of the GPU, it is unavailable to additional processes (unless using cuda streams that appearently are complex).
@ahojnnes I'm not sure I understand why THREADS_PER_BLOCK (https://github.com/colmap/colmap/blob/main/src/colmap/mvs/patch_match_cuda.cu#L45) has to be exactly 32? It seems to go into a lot of different places, but there does not seem to be any explicit place that would break for other values. So why does it seem to break for other values?
The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).
Basically this seems like the culprit. From talking to people who know more about cuda than me, even if we don't use all the threads of the GPU, it is unavailable to additional processes (unless using cuda streams that appearently are complex).
Yes, this is correct definitely for older architectures. As I said, I don't know about latest GPU architectures and whether something changed meanwhile.
The choice of threads per block is related to the "warp size", which you can read up if you are interested. I'd be surprised if changing this value would improve runtime performance.
@ahojnnes thanks. I'll report back if Im able to figure something out.
I could verify the observation of @Parskatt , where changing THREADS_PER_BLOCK (and kMaxPatchMatchWindowRadius) from 32 to 96 can accelerate it by ~2.8 times on one A100 GPU (although I don't know the source and how it will affect the performance)
I could verify the observation of @Parskatt , where changing THREADS_PER_BLOCK (and kMaxPatchMatchWindowRadius) from 32 to 96 can accelerate it by ~2.8 times on one A100 GPU (although I don't know the source and how it will affect the performance)
Can you verify that you actually get correct results? I found that actually the 3x faster results seemingly comes from complete failure (noise results). Still trying to understand why.
@Parskatt after a double check, it seems the results are incorrect, because if I run pycolmap.stereo_fusion, the returned would be:
W20240430 14:41:27.907872 1927978 fusion.cc:335] Could not fuse any points. This is likely caused by incorrect settings - filtering must be enabled for the last call to patch match stereo. I20240430 14:41:27.908705 1927978 fusion.cc:341] Number of fused points: 0
Yeah it breaks it, but I dont get why. I started looking at some other stuff in the meantime :D
true I am "encouraging" some people to build something like stereoanything, hope they make it fast enough lol