mitsuba3 icon indicating copy to clipboard operation
mitsuba3 copied to clipboard

Dr.Jit compiler failure: "Disk cache database error"

Open sagesimhon opened this issue 2 years ago • 2 comments

Getting this issue:

Critical Dr.Jit compiler failure: jit_optix_check(): API error 7012 (OPTIX_ERROR_DISK_CACHE_DATABASE_ERROR): "Disk cache database error" in /project/ext/drjit-core/src/optix_core.cpp:71.

Running multiple mistuba rendering tasks on the same machine across multiple graphics cards. Upon initialization, the above error is seen. Any ideas?

System configuration

System information: System information:

OS: Ubuntu 22.04.3 LTS CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz GPU: Tesla V100-SXM2-32GB Tesla V100-SXM2-32GB Tesla V100-SXM2-32GB Tesla V100-SXM2-32GB Tesla V100-SXM2-32GB Tesla V100-SXM2-32GB Tesla V100-SXM2-32GB Tesla V100-SXM2-32GB Python: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0] NVidia driver: 535.104.05 LLVM: 14.0.6

Dr.Jit: 0.4.2 Mitsuba: 3.3.0 Is custom build? False Compiled with: GNU 10.2.1 Variants: scalar_rgb scalar_spectral cuda_ad_rgb llvm_ad_rgb

Description

Running multiple mistuba rendering tasks on the same machine across multiple graphics cards. Upon initialization, the above error is seen. Any ideas?

Steps to reproduce

  1. Pull code at https://github.com/sagesimhon/totem_plus,
  2. Update run_generation_machine_distributed.sh specifying the hostnames and number of GPUs for testing, then run it.

sagesimhon avatar Sep 04 '23 20:09 sagesimhon

Hi @sagesimhon

I can't preproduce this, but I only have a single GPU. I wouldn't be surprised if OptiX's cache mechanism was device dependent and threw an error whenever two different devices tried to access the same cache. (Although the GPUs are all identical here :shrug:).

We've recently made it easier to increase the OptiX debug level, if you also bump the logging level, you might get some more information from OptiX as to why it's failing...

njroussel avatar Sep 06 '23 15:09 njroussel

Thanks i'll try that.

sagesimhon avatar Sep 08 '23 02:09 sagesimhon

I ran into this issue on a very similar multi GPU setup. I was able to work around it by explicitly setting the OPTIX_CACHE_PATH environment variable to /tmp. I don't think the issue is related to there being multiple GPUs, but it might be that the environment on these systems is more locked down than a regular linux install.

dvicini avatar Jan 31 '24 08:01 dvicini

Closing this issue -- inactivity.

If anyone has more to contribute to this discussion, please feel free to still comment on the current issue.

njroussel avatar Feb 19 '24 09:02 njroussel