pbrt-v4
pbrt-v4 copied to clipboard
Windows GPU performance much worse than linux
e.g. crown is 7.3s on a 3090 and 97.6s on a 2080 (though the 3090 is faster, it's not that much faster!).
Looking at the --stats output, the issue seems to be in the OptiX launches; hopefully it's an issue of compiler flags being wrong, inlining not happening as expected, etc.
The 7.3s with the 3090 was on Linux I'm guessing?
Just tested with 06000be on a 3080 on Windows for the crown scene, and it took ~1138s. The stats for it can be found in this text file.
I have found that in CPU mode, performance is better when compiling with MinGW instead of MSVC. Unfortunately, CUDA does not support MinGW, so this is not a 'fix' for GPU mode, although it might be possible to use Clang. Cheers..
@pierremoreau those times were on Linux (3090) and Windows (2080), indeed. I should add that I was rendering with 64spp for those, which probably explains your timings! Nevertheless, your stats seem to exhibit the same issue of the intersection kernels taking way too much time.
@pbrt4bounty in this case, the issue is almost certainly code that's running on the GPU, so it's likely that wouldn't make a difference (though that isn't yet certain!)
@pbrt4bounty in this case, the issue is almost certainly code that's running on the GPU, so it's likely that wouldn't make a difference (though that isn't yet certain!)
I meant CPU vs CPU, the most fast is the MinGW build: MSVC ->140 sec, MinGW -> 86 sec. in the same scene.
I can confirm that this improves performance significantly, going from 1138s to 259s on my 3080 when running without --spp 64, and down to ~32s with --spp 64 on the crown scene. So Linux still seems to be 3-4x faster than Windows (could be due to the differences in unified memory support), but that's a lot better than what it was before.
Did the OptiX validation have no impact on the Linux numbers? Since the validation seemed to be enabled on both OSes, I am surprised it would be responsible for most of the performance difference between the two.
Hm, interesting... For crown with --spp 64, I am seeing 6.6s on Linux with a 3090 and 12.7s in Windows with a 2080 Ti--that's roughly what I'd expect given those two GPUs. However, the nsys traces from both are still showing that the GPU is periodically going idle for ~1ms with both of them, which shouldn't be happening.
Here is what I get from --stats:
Linux (3090)
Wavefront Kernel Profile:
Generate Camera rays 128 launches 89.81 ms / 1.4% (avg 0.702, min 0.590, max 0.824)
Reset queues before tracing rays 12928 launches 52.01 ms / 0.8% (avg 0.004, min 0.003, max 0.005)
Generate ray samples - HaltonSampler 12928 launches 352.87 ms / 5.3% (avg 0.027, min 0.015, max 0.176)
Tracing closest hit rays 12928 launches 1442.93 ms / 21.9% (avg 0.112, min 0.043, max 14.738)
Sample medium interaction 12928 launches 277.36 ms / 4.2% (avg 0.021, min 0.015, max 0.032)
Sample direct/indirect - Henyey Greenstein 12800 launches 197.64 ms / 3.0% (avg 0.015, min 0.015, max 0.017)
Handle emitters hit by indirect rays 12928 launches 253.84 ms / 3.8% (avg 0.020, min 0.015, max 0.044)
CoatedDiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 1248.26 ms / 18.9% (avg 0.098, min 0.015, max 1.888)
ConductorMaterial + BxDF Eval (Basic tex) 12800 launches 567.37 ms / 8.6% (avg 0.044, min 0.017, max 0.283)
DielectricMaterial + BxDF Eval (Basic tex) 12800 launches 261.78 ms / 4.0% (avg 0.020, min 0.015, max 0.047)
DiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 835.63 ms / 12.7% (avg 0.065, min 0.015, max 1.532)
Tracing shadow Tr rays 12800 launches 891.22 ms / 13.5% (avg 0.070, min 0.023, max 1.545)
Reset shadowRayQueue 12800 launches 51.95 ms / 0.8% (avg 0.004, min 0.003, max 0.005)
Update indirect ray stats 12800 launches 52.81 ms / 0.8% (avg 0.004, min 0.003, max 0.006)
Update Film 128 launches 25.49 ms / 0.4% (avg 0.199, min 0.197, max 0.204)
Other 256 launches 0.98 ms / 0.0% (avg 0.004)
Total rendering time: 6601.93 ms
Windows (2080Ti)
Wavefront Kernel Profile:
Generate Camera rays 128 launches 126.60 ms / 1.0% (avg 0.989, min 0.890, max 1.113)
Reset queues before tracing rays 12928 launches 157.46 ms / 1.2% (avg 0.012, min 0.003, max 0.910)
Generate ray samples - HaltonSampler 12928 launches 424.06 ms / 3.3% (avg 0.033, min 0.015, max 0.439)
Tracing closest hit rays 12928 launches 3747.61 ms / 29.4% (avg 0.290, min 0.065, max 31.043)
Sample medium interaction 12928 launches 395.34 ms / 3.1% (avg 0.031, min 0.015, max 0.753)
Sample direct/indirect - Henyey Greenstein 12800 launches 304.85 ms / 2.4% (avg 0.024, min 0.014, max 0.731)
Handle emitters hit by indirect rays 12928 launches 318.90 ms / 2.5% (avg 0.025, min 0.014, max 0.736)
CoatedDiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 1944.34 ms / 15.3% (avg 0.152, min 0.015, max 3.812)
ConductorMaterial + BxDF Eval (Basic tex) 12800 launches 859.05 ms / 6.7% (avg 0.067, min 0.016, max 0.734)
DielectricMaterial + BxDF Eval (Basic tex) 12800 launches 315.28 ms / 2.5% (avg 0.025, min 0.015, max 0.467)
DiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 1616.21 ms / 12.7% (avg 0.126, min 0.015, max 3.756)
Tracing shadow Tr rays 12800 launches 2256.75 ms / 17.7% (avg 0.176, min 0.025, max 2.935)
Reset shadowRayQueue 12800 launches 167.94 ms / 1.3% (avg 0.013, min 0.003, max 0.739)
Update indirect ray stats 12800 launches 61.09 ms / 0.5% (avg 0.005, min 0.003, max 0.698)
Update Film 128 launches 43.36 ms / 0.3% (avg 0.339, min 0.335, max 0.344)
Other 256 launches 6.78 ms / 0.1% (avg 0.026)
Total rendering time: 12745.59 ms
Things are mostly proportional, though on Windows the OptiX kernels and the queue resets seem disproportionately slow. If you can capture these on your system, that'd be interesting, since it'd be the same GPU for both, which would make any issues more clear...
Also, OptiX validation only has about a 5% perf. impact on Linux, which presumably explains why I didn't notice any issues when I enabled it (when the Windows GPU path was broken...)
Regarding Windows (will reboot later to try it out on Linux), I am getting some variation but roughly in line with what you have as well; the following are all with --spp 64 --stats on a 3080 on Windows for the crown scene.
From this morning:
Wavefront Kernel Profile:
Reset ray queue 128 launches 32.92 ms / 0.1% (avg 0.257, min 0.003, max 5.230)
Generate Camera rays 128 launches 150.84 ms / 0.5% (avg 1.178, min 0.745, max 6.973)
Reset queues before tracing rays 12928 launches 1880.37 ms / 6.1% (avg 0.145, min 0.003, max 11.623)
Generate ray samples - HaltonSampler 12928 launches 585.17 ms / 1.9% (avg 0.045, min 0.016, max 6.991)
Tracing closest hit rays 12928 launches 6945.07 ms / 22.5% (avg 0.537, min 0.068, max 49.108)
Sample medium interaction 12928 launches 2220.99 ms / 7.2% (avg 0.172, min 0.015, max 12.036)
Sample direct/indirect - Henyey Greenstein 12800 launches 2055.74 ms / 6.7% (avg 0.161, min 0.015, max 8.569)
Handle emitters hit by indirect rays 12928 launches 492.10 ms / 1.6% (avg 0.038, min 0.015, max 6.397)
CoatedDiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 3858.73 ms / 12.5% (avg 0.301, min 0.015, max 13.389)
ConductorMaterial + BxDF Eval (Basic tex) 12800 launches 2563.04 ms / 8.3% (avg 0.200, min 0.016, max 10.193)
DielectricMaterial + BxDF Eval (Basic tex) 12800 launches 442.02 ms / 1.4% (avg 0.035, min 0.015, max 6.563)
DiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 3225.28 ms / 10.5% (avg 0.252, min 0.015, max 10.830)
Tracing shadow Tr rays 12800 launches 4148.21 ms / 13.4% (avg 0.324, min 0.027, max 9.917)
Reset shadowRayQueue 12800 launches 2029.66 ms / 6.6% (avg 0.159, min 0.003, max 8.155)
Update indirect ray stats 12800 launches 183.01 ms / 0.6% (avg 0.014, min 0.003, max 5.804)
Update Film 128 launches 34.20 ms / 0.1% (avg 0.267, min 0.254, max 0.604)
Other 128 launches 1.65 ms / 0.0% (avg 0.013)
same binary as run this morning, but run anew:
Wavefront Kernel Profile:
Generate Camera rays 128 launches 134.88 ms / 0.4% (avg 1.054, min 0.830, max 4.106)
Reset queues before tracing rays 12928 launches 2184.85 ms / 7.2% (avg 0.169, min 0.003, max 6.070)
Generate ray samples - HaltonSampler 12928 launches 868.41 ms / 2.9% (avg 0.067, min 0.015, max 5.418)
Tracing closest hit rays 12928 launches 5010.56 ms / 16.6% (avg 0.388, min 0.063, max 48.365)
Sample medium interaction 12928 launches 1918.20 ms / 6.4% (avg 0.148, min 0.015, max 4.923)
Sample direct/indirect - Henyey Greenstein 12800 launches 2325.03 ms / 7.7% (avg 0.182, min 0.015, max 7.705)
Handle emitters hit by indirect rays 12928 launches 872.42 ms / 2.9% (avg 0.067, min 0.014, max 4.019)
CoatedDiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 3412.91 ms / 11.3% (avg 0.267, min 0.015, max 5.111)
ConductorMaterial + BxDF Eval (Basic tex) 12800 launches 2807.96 ms / 9.3% (avg 0.219, min 0.016, max 6.351)
DielectricMaterial + BxDF Eval (Basic tex) 12800 launches 830.74 ms / 2.8% (avg 0.065, min 0.015, max 4.686)
DiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 2883.01 ms / 9.5% (avg 0.225, min 0.015, max 5.145)
Tracing shadow Tr rays 12800 launches 4629.77 ms / 15.3% (avg 0.362, min 0.027, max 6.450)
Reset shadowRayQueue 12800 launches 1705.00 ms / 5.6% (avg 0.133, min 0.003, max 5.462)
Update indirect ray stats 12800 launches 552.64 ms / 1.8% (avg 0.043, min 0.003, max 4.604)
Update Film 128 launches 41.68 ms / 0.1% (avg 0.326, min 0.257, max 2.143)
Other 256 launches 16.05 ms / 0.1% (avg 0.063)
I'll try testing with only the 3080 plugged in and see if it makes any difference, though PBRT has the CUDA_VISIBLE_DEVICES workaround already built in.
This is "interesting" in your numbers:
Reset queues before tracing rays 12928 launches 1880.37 ms / 6.1% (avg 0.145, min 0.003, max 11.623)
(And 7.2% in your second run.) That's way higher than I'm seeing on either Linux or Windows, and it really should be in the noise as far as runtime.
With the latest version of nsight systems (catching up to my driver version, which it was complaining about being too recent for it), I am no longer seeing the GPU going idle during rendering on Windows, which is good news. However, I am still troubled by those long "reset queues" times you're seeing, @pierremoreau...
Here are the results from the same system, also running CUDA 11.2 and OptiX 7.2, but on Linux (and it rendered in 7-8s):
Wavefront Kernel Profile:
Generate Camera rays 128 launches 116.32 ms / 1.6% (avg 0.909, min 0.726, max 1.233)
Reset queues before tracing rays 12928 launches 62.46 ms / 0.8% (avg 0.005, min 0.003, max 0.243)
Generate ray samples - HaltonSampler 12928 launches 381.84 ms / 5.1% (avg 0.030, min 0.015, max 0.554)
Tracing closest hit rays 12928 launches 1665.66 ms / 22.3% (avg 0.129, min 0.044, max 13.206)
Sample medium interaction 12928 launches 300.89 ms / 4.0% (avg 0.023, min 0.015, max 0.340)
Sample direct/indirect - Henyey Greenstein 12800 launches 217.17 ms / 2.9% (avg 0.017, min 0.015, max 0.332)
Handle emitters hit by indirect rays 12928 launches 271.53 ms / 3.6% (avg 0.021, min 0.015, max 0.370)
CoatedDiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 1400.24 ms / 18.7% (avg 0.109, min 0.015, max 2.952)
ConductorMaterial + BxDF Eval (Basic tex) 12800 launches 623.56 ms / 8.3% (avg 0.049, min 0.016, max 0.664)
DielectricMaterial + BxDF Eval (Basic tex) 12800 launches 280.92 ms / 3.8% (avg 0.022, min 0.015, max 0.368)
DiffuseMaterial + BxDF Eval (Basic tex) 12800 launches 988.27 ms / 13.2% (avg 0.077, min 0.015, max 2.280)
Tracing shadow Tr rays 12800 launches 1010.32 ms / 13.5% (avg 0.079, min 0.026, max 1.896)
Reset shadowRayQueue 12800 launches 61.77 ms / 0.8% (avg 0.005, min 0.003, max 0.245)
Update indirect ray stats 12800 launches 62.12 ms / 0.8% (avg 0.005, min 0.004, max 0.223)
Update Film 128 launches 32.88 ms / 0.4% (avg 0.257, min 0.241, max 0.556)
Other 256 launches 1.15 ms / 0.0% (avg 0.004)
I ran with the 1080 Ti unplugged for this run. I also did a run on Windows without the 1080 Ti, but the numbers were quite close to the ones I got this morning.
It looks like your Linux 3080 numbers are generally 10-15% slower than my Linux 3090 numbers, so that's good as far as that being roughly the difference I'd expect. So it seems we are left with just Windows still being off.
Would a Nsight Systems trace on Windows help, or something else?
Sure, that'd be interesting to take a look at.
I'll try to gather one over the weekend.
How do you make an Nsight System trace? I tried using the same setup I had in the past, but the profiling stops 196ms after profiling starts (apparently due to the last profiled process having exited), resulting in no CUDA events being collected at all (and very few events overall). The only thing I can see in the timeline view, is that all CPU threads seem to be waiting on some user request (see screenshot below); from the logs I can see that the rendering did start so it should not be an issue with my command-line arguments.
