pbrt-v4 icon indicating copy to clipboard operation
pbrt-v4 copied to clipboard

Windows GPU performance much worse than linux

Open mmp opened this issue 3 years ago • 16 comments

e.g. crown is 7.3s on a 3090 and 97.6s on a 2080 (though the 3090 is faster, it's not that much faster!).

Looking at the --stats output, the issue seems to be in the OptiX launches; hopefully it's an issue of compiler flags being wrong, inlining not happening as expected, etc.

mmp avatar May 04 '21 00:05 mmp

The 7.3s with the 3090 was on Linux I'm guessing?

Just tested with 06000be on a 3080 on Windows for the crown scene, and it took ~1138s. The stats for it can be found in this text file.

pierremoreau avatar May 04 '21 06:05 pierremoreau

I have found that in CPU mode, performance is better when compiling with MinGW instead of MSVC. Unfortunately, CUDA does not support MinGW, so this is not a 'fix' for GPU mode, although it might be possible to use Clang. Cheers..

pbrt4bounty avatar May 04 '21 07:05 pbrt4bounty

@pierremoreau those times were on Linux (3090) and Windows (2080), indeed. I should add that I was rendering with 64spp for those, which probably explains your timings! Nevertheless, your stats seem to exhibit the same issue of the intersection kernels taking way too much time.

@pbrt4bounty in this case, the issue is almost certainly code that's running on the GPU, so it's likely that wouldn't make a difference (though that isn't yet certain!)

mmp avatar May 04 '21 13:05 mmp

@pbrt4bounty in this case, the issue is almost certainly code that's running on the GPU, so it's likely that wouldn't make a difference (though that isn't yet certain!)

I meant CPU vs CPU, the most fast is the MinGW build: MSVC ->140 sec, MinGW -> 86 sec. in the same scene.

pbrt4bounty avatar May 04 '21 14:05 pbrt4bounty

I can confirm that this improves performance significantly, going from 1138s to 259s on my 3080 when running without --spp 64, and down to ~32s with --spp 64 on the crown scene. So Linux still seems to be 3-4x faster than Windows (could be due to the differences in unified memory support), but that's a lot better than what it was before.

Did the OptiX validation have no impact on the Linux numbers? Since the validation seemed to be enabled on both OSes, I am surprised it would be responsible for most of the performance difference between the two.

pierremoreau avatar May 05 '21 06:05 pierremoreau

Hm, interesting... For crown with --spp 64, I am seeing 6.6s on Linux with a 3090 and 12.7s in Windows with a 2080 Ti--that's roughly what I'd expect given those two GPUs. However, the nsys traces from both are still showing that the GPU is periodically going idle for ~1ms with both of them, which shouldn't be happening.

Here is what I get from --stats:

Linux (3090)

Wavefront Kernel Profile:
  Generate Camera rays                                128 launches     89.81 ms /   1.4% (avg  0.702, min  0.590, max   0.824)
  Reset queues before tracing rays                  12928 launches     52.01 ms /   0.8% (avg  0.004, min  0.003, max   0.005)
  Generate ray samples - HaltonSampler              12928 launches    352.87 ms /   5.3% (avg  0.027, min  0.015, max   0.176)
  Tracing closest hit rays                          12928 launches   1442.93 ms /  21.9% (avg  0.112, min  0.043, max  14.738)
  Sample medium interaction                         12928 launches    277.36 ms /   4.2% (avg  0.021, min  0.015, max   0.032)
  Sample direct/indirect - Henyey Greenstein        12800 launches    197.64 ms /   3.0% (avg  0.015, min  0.015, max   0.017)
  Handle emitters hit by indirect rays              12928 launches    253.84 ms /   3.8% (avg  0.020, min  0.015, max   0.044)
  CoatedDiffuseMaterial + BxDF Eval (Basic tex)     12800 launches   1248.26 ms /  18.9% (avg  0.098, min  0.015, max   1.888)
  ConductorMaterial + BxDF Eval (Basic tex)         12800 launches    567.37 ms /   8.6% (avg  0.044, min  0.017, max   0.283)
  DielectricMaterial + BxDF Eval (Basic tex)        12800 launches    261.78 ms /   4.0% (avg  0.020, min  0.015, max   0.047)
  DiffuseMaterial + BxDF Eval (Basic tex)           12800 launches    835.63 ms /  12.7% (avg  0.065, min  0.015, max   1.532)
  Tracing shadow Tr rays                            12800 launches    891.22 ms /  13.5% (avg  0.070, min  0.023, max   1.545)
  Reset shadowRayQueue                              12800 launches     51.95 ms /   0.8% (avg  0.004, min  0.003, max   0.005)
  Update indirect ray stats                         12800 launches     52.81 ms /   0.8% (avg  0.004, min  0.003, max   0.006)
  Update Film                                         128 launches     25.49 ms /   0.4% (avg  0.199, min  0.197, max   0.204)
  Other                                               256 launches      0.98 ms /   0.0% (avg  0.004)

Total rendering time:   6601.93 ms

Windows (2080Ti)

Wavefront Kernel Profile:
  Generate Camera rays                                128 launches    126.60 ms /   1.0% (avg  0.989, min  0.890, max   1.113)
  Reset queues before tracing rays                  12928 launches    157.46 ms /   1.2% (avg  0.012, min  0.003, max   0.910)
  Generate ray samples - HaltonSampler              12928 launches    424.06 ms /   3.3% (avg  0.033, min  0.015, max   0.439)
  Tracing closest hit rays                          12928 launches   3747.61 ms /  29.4% (avg  0.290, min  0.065, max  31.043)
  Sample medium interaction                         12928 launches    395.34 ms /   3.1% (avg  0.031, min  0.015, max   0.753)
  Sample direct/indirect - Henyey Greenstein        12800 launches    304.85 ms /   2.4% (avg  0.024, min  0.014, max   0.731)
  Handle emitters hit by indirect rays              12928 launches    318.90 ms /   2.5% (avg  0.025, min  0.014, max   0.736)
  CoatedDiffuseMaterial + BxDF Eval (Basic tex)     12800 launches   1944.34 ms /  15.3% (avg  0.152, min  0.015, max   3.812)
  ConductorMaterial + BxDF Eval (Basic tex)         12800 launches    859.05 ms /   6.7% (avg  0.067, min  0.016, max   0.734)
  DielectricMaterial + BxDF Eval (Basic tex)        12800 launches    315.28 ms /   2.5% (avg  0.025, min  0.015, max   0.467)
  DiffuseMaterial + BxDF Eval (Basic tex)           12800 launches   1616.21 ms /  12.7% (avg  0.126, min  0.015, max   3.756)
  Tracing shadow Tr rays                            12800 launches   2256.75 ms /  17.7% (avg  0.176, min  0.025, max   2.935)
  Reset shadowRayQueue                              12800 launches    167.94 ms /   1.3% (avg  0.013, min  0.003, max   0.739)
  Update indirect ray stats                         12800 launches     61.09 ms /   0.5% (avg  0.005, min  0.003, max   0.698)
  Update Film                                         128 launches     43.36 ms /   0.3% (avg  0.339, min  0.335, max   0.344)
  Other                                               256 launches      6.78 ms /   0.1% (avg  0.026)

Total rendering time:  12745.59 ms

Things are mostly proportional, though on Windows the OptiX kernels and the queue resets seem disproportionately slow. If you can capture these on your system, that'd be interesting, since it'd be the same GPU for both, which would make any issues more clear...

mmp avatar May 05 '21 14:05 mmp

Also, OptiX validation only has about a 5% perf. impact on Linux, which presumably explains why I didn't notice any issues when I enabled it (when the Windows GPU path was broken...)

mmp avatar May 05 '21 14:05 mmp

Regarding Windows (will reboot later to try it out on Linux), I am getting some variation but roughly in line with what you have as well; the following are all with --spp 64 --stats on a 3080 on Windows for the crown scene.

From this morning:

Wavefront Kernel Profile:
  Reset ray queue                                     128 launches     32.92 ms /   0.1% (avg  0.257, min  0.003, max   5.230)
  Generate Camera rays                                128 launches    150.84 ms /   0.5% (avg  1.178, min  0.745, max   6.973)
  Reset queues before tracing rays                  12928 launches   1880.37 ms /   6.1% (avg  0.145, min  0.003, max  11.623)
  Generate ray samples - HaltonSampler              12928 launches    585.17 ms /   1.9% (avg  0.045, min  0.016, max   6.991)
  Tracing closest hit rays                          12928 launches   6945.07 ms /  22.5% (avg  0.537, min  0.068, max  49.108)
  Sample medium interaction                         12928 launches   2220.99 ms /   7.2% (avg  0.172, min  0.015, max  12.036)
  Sample direct/indirect - Henyey Greenstein        12800 launches   2055.74 ms /   6.7% (avg  0.161, min  0.015, max   8.569)
  Handle emitters hit by indirect rays              12928 launches    492.10 ms /   1.6% (avg  0.038, min  0.015, max   6.397)
  CoatedDiffuseMaterial + BxDF Eval (Basic tex)     12800 launches   3858.73 ms /  12.5% (avg  0.301, min  0.015, max  13.389)
  ConductorMaterial + BxDF Eval (Basic tex)         12800 launches   2563.04 ms /   8.3% (avg  0.200, min  0.016, max  10.193)
  DielectricMaterial + BxDF Eval (Basic tex)        12800 launches    442.02 ms /   1.4% (avg  0.035, min  0.015, max   6.563)
  DiffuseMaterial + BxDF Eval (Basic tex)           12800 launches   3225.28 ms /  10.5% (avg  0.252, min  0.015, max  10.830)
  Tracing shadow Tr rays                            12800 launches   4148.21 ms /  13.4% (avg  0.324, min  0.027, max   9.917)
  Reset shadowRayQueue                              12800 launches   2029.66 ms /   6.6% (avg  0.159, min  0.003, max   8.155)
  Update indirect ray stats                         12800 launches    183.01 ms /   0.6% (avg  0.014, min  0.003, max   5.804)
  Update Film                                         128 launches     34.20 ms /   0.1% (avg  0.267, min  0.254, max   0.604)
  Other                                               128 launches      1.65 ms /   0.0% (avg  0.013)

same binary as run this morning, but run anew:

Wavefront Kernel Profile:
  Generate Camera rays                                128 launches    134.88 ms /   0.4% (avg  1.054, min  0.830, max   4.106)
  Reset queues before tracing rays                  12928 launches   2184.85 ms /   7.2% (avg  0.169, min  0.003, max   6.070)
  Generate ray samples - HaltonSampler              12928 launches    868.41 ms /   2.9% (avg  0.067, min  0.015, max   5.418)
  Tracing closest hit rays                          12928 launches   5010.56 ms /  16.6% (avg  0.388, min  0.063, max  48.365)
  Sample medium interaction                         12928 launches   1918.20 ms /   6.4% (avg  0.148, min  0.015, max   4.923)
  Sample direct/indirect - Henyey Greenstein        12800 launches   2325.03 ms /   7.7% (avg  0.182, min  0.015, max   7.705)
  Handle emitters hit by indirect rays              12928 launches    872.42 ms /   2.9% (avg  0.067, min  0.014, max   4.019)
  CoatedDiffuseMaterial + BxDF Eval (Basic tex)     12800 launches   3412.91 ms /  11.3% (avg  0.267, min  0.015, max   5.111)
  ConductorMaterial + BxDF Eval (Basic tex)         12800 launches   2807.96 ms /   9.3% (avg  0.219, min  0.016, max   6.351)
  DielectricMaterial + BxDF Eval (Basic tex)        12800 launches    830.74 ms /   2.8% (avg  0.065, min  0.015, max   4.686)
  DiffuseMaterial + BxDF Eval (Basic tex)           12800 launches   2883.01 ms /   9.5% (avg  0.225, min  0.015, max   5.145)
  Tracing shadow Tr rays                            12800 launches   4629.77 ms /  15.3% (avg  0.362, min  0.027, max   6.450)
  Reset shadowRayQueue                              12800 launches   1705.00 ms /   5.6% (avg  0.133, min  0.003, max   5.462)
  Update indirect ray stats                         12800 launches    552.64 ms /   1.8% (avg  0.043, min  0.003, max   4.604)
  Update Film                                         128 launches     41.68 ms /   0.1% (avg  0.326, min  0.257, max   2.143)
  Other                                               256 launches     16.05 ms /   0.1% (avg  0.063)

I'll try testing with only the 3080 plugged in and see if it makes any difference, though PBRT has the CUDA_VISIBLE_DEVICES workaround already built in.

pierremoreau avatar May 05 '21 15:05 pierremoreau

This is "interesting" in your numbers:

  Reset queues before tracing rays                  12928 launches   1880.37 ms /   6.1% (avg  0.145, min  0.003, max  11.623)

(And 7.2% in your second run.) That's way higher than I'm seeing on either Linux or Windows, and it really should be in the noise as far as runtime.

mmp avatar May 05 '21 16:05 mmp

With the latest version of nsight systems (catching up to my driver version, which it was complaining about being too recent for it), I am no longer seeing the GPU going idle during rendering on Windows, which is good news. However, I am still troubled by those long "reset queues" times you're seeing, @pierremoreau...

mmp avatar May 05 '21 18:05 mmp

Here are the results from the same system, also running CUDA 11.2 and OptiX 7.2, but on Linux (and it rendered in 7-8s):

Wavefront Kernel Profile:
  Generate Camera rays                                128 launches    116.32 ms /   1.6% (avg  0.909, min  0.726, max   1.233)
  Reset queues before tracing rays                  12928 launches     62.46 ms /   0.8% (avg  0.005, min  0.003, max   0.243)
  Generate ray samples - HaltonSampler              12928 launches    381.84 ms /   5.1% (avg  0.030, min  0.015, max   0.554)
  Tracing closest hit rays                          12928 launches   1665.66 ms /  22.3% (avg  0.129, min  0.044, max  13.206)
  Sample medium interaction                         12928 launches    300.89 ms /   4.0% (avg  0.023, min  0.015, max   0.340)
  Sample direct/indirect - Henyey Greenstein        12800 launches    217.17 ms /   2.9% (avg  0.017, min  0.015, max   0.332)
  Handle emitters hit by indirect rays              12928 launches    271.53 ms /   3.6% (avg  0.021, min  0.015, max   0.370)
  CoatedDiffuseMaterial + BxDF Eval (Basic tex)     12800 launches   1400.24 ms /  18.7% (avg  0.109, min  0.015, max   2.952)
  ConductorMaterial + BxDF Eval (Basic tex)         12800 launches    623.56 ms /   8.3% (avg  0.049, min  0.016, max   0.664)
  DielectricMaterial + BxDF Eval (Basic tex)        12800 launches    280.92 ms /   3.8% (avg  0.022, min  0.015, max   0.368)
  DiffuseMaterial + BxDF Eval (Basic tex)           12800 launches    988.27 ms /  13.2% (avg  0.077, min  0.015, max   2.280)
  Tracing shadow Tr rays                            12800 launches   1010.32 ms /  13.5% (avg  0.079, min  0.026, max   1.896)
  Reset shadowRayQueue                              12800 launches     61.77 ms /   0.8% (avg  0.005, min  0.003, max   0.245)
  Update indirect ray stats                         12800 launches     62.12 ms /   0.8% (avg  0.005, min  0.004, max   0.223)
  Update Film                                         128 launches     32.88 ms /   0.4% (avg  0.257, min  0.241, max   0.556)
  Other                                               256 launches      1.15 ms /   0.0% (avg  0.004)

I ran with the 1080 Ti unplugged for this run. I also did a run on Windows without the 1080 Ti, but the numbers were quite close to the ones I got this morning.

pierremoreau avatar May 05 '21 19:05 pierremoreau

It looks like your Linux 3080 numbers are generally 10-15% slower than my Linux 3090 numbers, so that's good as far as that being roughly the difference I'd expect. So it seems we are left with just Windows still being off.

mmp avatar May 05 '21 21:05 mmp

Would a Nsight Systems trace on Windows help, or something else?

pierremoreau avatar May 06 '21 06:05 pierremoreau

Sure, that'd be interesting to take a look at.

mmp avatar May 06 '21 16:05 mmp

I'll try to gather one over the weekend.

pierremoreau avatar May 07 '21 19:05 pierremoreau

How do you make an Nsight System trace? I tried using the same setup I had in the past, but the profiling stops 196ms after profiling starts (apparently due to the last profiled process having exited), resulting in no CUDA events being collected at all (and very few events overall). The only thing I can see in the timeline view, is that all CPU threads seem to be waiting on some user request (see screenshot below); from the logs I can see that the rendering did start so it should not be an issue with my command-line arguments.

Screenshot of NsightSystems' timeline view

pierremoreau avatar May 11 '21 06:05 pierremoreau