colmap Unable to get COLMAP MVS above ~25% GPU power usage

Issue

Running patch_match_stereo on an A100 GPU the GPU "util" fills up, but the power usage is extremely low, see graph below:

This seems to indicate that there are some cuda kernels in patchmatch that take a lot of "clock-time" but are not using the GPU cores effectively.

Some attempts to fix I've tried

Tried running multiple MVS in multiprocess (above is from such a run). However, I think the 100% util is basically hard-limiting the performance.
Tried moving data to fast RAM (hoping it was IO issue), basically no difference.

I'm wondering if this is reproducible across systems. Is there any fix?

System

Docker container with: docker://colmap/colmap:20231001.8

Apr 24 '24 09:04 Parskatt

It's fine for me that it's a bit inefficient, but the cluster I'm running on automatically kills jobs that go below 25% power, which is quite frustrating, so I'd like to fix this.

Apr 24 '24 09:04 Parskatt

I got stuff to run faster by upping THREADS_PER_BLOCK from 32 -> 96 (64 also works fine), on my fork. This gave about 2x speedup for me. However, it's still just using about 30% power. It feels like there are major bottlenecks left. What is likely the culprit here @ahojnnes ? I can change in my fork and report back if you have any hunches.

Apr 24 '24 17:04 Parskatt

The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).

Apr 24 '24 21:04 ahojnnes

@ahojnnes would this be similar to what I said regarding multiple processes in parallel on the same GPU? Because I had almost no success in speeding up with that approach. I will try my luck profiling the code to see what seems to be taking most time.

Apr 24 '24 21:04 Parskatt

If you tried that, it should be mostly the same as my suggestion above. I am not too familiar with latest GPU architectures. The A100 does have specialized tensor cores for matrix multiplication that cannot be leveraged for patch match stereo, so this may explain the behavior you see.

Apr 25 '24 04:04 ahojnnes

Since SweepFromTopToBottom takes up basically entire computation time it's difficult to tell haha. Perhaps I can split up the kernel for the purpose of debugging?

Actually: I used the wrong profiler, apparently the one to use is nsight-compute (not nsight-systems), I'll try running the compute version and see if I can get a more detailed report. This is seemingly the way to do it.

EDIT2: I think this thread might reveal how stupid I am, I really don't know how to code.

Apr 25 '24 11:04 Parskatt

The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).

Basically this seems like the culprit. From talking to people who know more about cuda than me, even if we don't use all the threads of the GPU, it is unavailable to additional processes (unless using cuda streams that appearently are complex).

Apr 25 '24 12:04 Parskatt

@ahojnnes I'm not sure I understand why THREADS_PER_BLOCK (https://github.com/colmap/colmap/blob/main/src/colmap/mvs/patch_match_cuda.cu#L45) has to be exactly 32? It seems to go into a lot of different places, but there does not seem to be any explicit place that would break for other values. So why does it seem to break for other values?

Apr 25 '24 16:04 Parskatt

The algorithm imposes that each row or column of an image uses one cuda thread. Depending on the size of your images, there will be many more cores than can be occupied by the image. You may be able to get more out or your GPU by simply listing the same GPU index of your a100 multiple times (though I have not tried this myself yet).

Basically this seems like the culprit. From talking to people who know more about cuda than me, even if we don't use all the threads of the GPU, it is unavailable to additional processes (unless using cuda streams that appearently are complex).

Yes, this is correct definitely for older architectures. As I said, I don't know about latest GPU architectures and whether something changed meanwhile.

The choice of threads per block is related to the "warp size", which you can read up if you are interested. I'd be surprised if changing this value would improve runtime performance.

Apr 26 '24 05:04 ahojnnes

@ahojnnes thanks. I'll report back if Im able to figure something out.

Apr 26 '24 05:04 Parskatt

I could verify the observation of @Parskatt , where changing THREADS_PER_BLOCK (and kMaxPatchMatchWindowRadius) from 32 to 96 can accelerate it by ~2.8 times on one A100 GPU (although I don't know the source and how it will affect the performance)

Apr 30 '24 01:04 jytime

I could verify the observation of @Parskatt , where changing THREADS_PER_BLOCK (and kMaxPatchMatchWindowRadius) from 32 to 96 can accelerate it by ~2.8 times on one A100 GPU (although I don't know the source and how it will affect the performance)

Can you verify that you actually get correct results? I found that actually the 3x faster results seemingly comes from complete failure (noise results). Still trying to understand why.

Apr 30 '24 05:04 Parskatt

@Parskatt after a double check, it seems the results are incorrect, because if I run pycolmap.stereo_fusion, the returned would be:

W20240430 14:41:27.907872 1927978 fusion.cc:335] Could not fuse any points. This is likely caused by incorrect settings - filtering must be enabled for the last call to patch match stereo. I20240430 14:41:27.908705 1927978 fusion.cc:341] Number of fused points: 0

Apr 30 '24 14:04 jytime

Yeah it breaks it, but I dont get why. I started looking at some other stuff in the meantime :D

Apr 30 '24 16:04 Parskatt

true I am "encouraging" some people to build something like stereoanything, hope they make it fast enough lol

Apr 30 '24 16:04 jytime

colmap colmap copied to clipboard

Unable to get COLMAP MVS above ~25% GPU power usage

Issue

Some attempts to fix I've tried

System

colmap
colmap copied to clipboard