OpenSubdiv icon indicating copy to clipboard operation
OpenSubdiv copied to clipboard

Building with high parallelism and CUDA support results in sporadic build failures

Open amarshall opened this issue 2 years ago • 5 comments
trafficstars

Building with 48 threads, of 50 sequential builds, 19 failed (38% failure rate). Am building via nixpkgs drv, but I don’t see any reason why it’s specific to that build environment. Building without CUDA saw no failures in 50 runs.

My guess is there’s an implicit dependency somewhere, I spent a brief bit trying to find it but did not (I’m not very proficient with CMake).

I have seen at least two different failures:

CMake Error at /nix/store/0dv0ylafnx7cdajyv9ahbpqrniblixq1-cmake-3.26.4/share/cmake-3.26/Modules/FindCUDA/make2cmake.cmake:48 (file):
  file failed to open for reading (No such file or directory):

    /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.NVCC-depend


CMake Error at osd_static_gpu_generated_cudaKernel.cu.o.Release.cmake:236 (message):
  Error generating
  /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/./osd_static_gpu_generated_cudaKernel.cu.o


make[2]: *** [opensubdiv/CMakeFiles/osd_dynamic_gpu.dir/build.make:77: opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o] Error 1

and

Error copying file (if different) from "/build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.depend.tmp" to "/build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.depend".
CMake Error at osd_static_gpu_generated_cudaKernel.cu.o.Release.cmake:246 (message):
  Error generating
  /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/./osd_static_gpu_generated_cudaKernel.cu.o


make[2]: *** [opensubdiv/CMakeFiles/osd_dynamic_gpu.dir/build.make:77: opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o] Error 1

amarshall avatar Jul 24 '23 18:07 amarshall

Filed as internal issue #OSD-426

davidgyu avatar Jul 25 '23 15:07 davidgyu

Interesting. We haven't seen that before. Can you tell us more about your system configuration: OS, Compiler, GPU, Driver version, CUDA version?

davidgyu avatar Jul 25 '23 16:07 davidgyu

Hi! Thanks for the reply.

  • OS is NixOS @ https://github.com/NixOS/nixpkgs/commit/9ca785644d067445a4aa749902b29ccef61f7476 (Linux Kernel 6.1)
  • Opensubdiv src @ v3.5.0
  • GCC 12.3.0 (note that -DCUDA_HOST_COMPILER is different), CMake 3.26.4
  • CUDA toolkit 11.8.0
  • CPU is AMD 3960X (24-core, 48-threads), 192 GB RAM
  • GPU is 3080 Ti with driver 535.86.05 (however I think this should not matter, as I don’t believe the GPU is used during build)
Log output of configure stage + build flags

Note that I have manually wrapped the cmake flags to make them easier to read.

@nix { "action": "setPhase", "phase": "configurePhase" }
configuring
fixing cmake files...
cmake flags:
  -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF
  -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF
  -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON
  -DCMAKE_BUILD_TYPE=Release
  -DBUILD_TESTING=OFF
  -DCMAKE_INSTALL_LOCALEDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/locale
  -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/libexec
  -DCMAKE_INSTALL_LIBDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/lib
  -DCMAKE_INSTALL_DOCDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/doc/OpenSubdiv
  -DCMAKE_INSTALL_INFODIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/info
  -DCMAKE_INSTALL_MANDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/man
  -DCMAKE_INSTALL_OLDINCLUDEDIR=/nix/store/1np3p9y42nv1m06ywspgqj20r5p41xla-opensubdiv-3.5.0-dev/include
  -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/1np3p9y42nv1m06ywspgqj20r5p41xla-opensubdiv-3.5.0-dev/include
  -DCMAKE_INSTALL_SBINDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/sbin
  -DCMAKE_INSTALL_BINDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/bin
  -DCMAKE_INSTALL_NAME_DIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/lib
  -DCMAKE_POLICY_DEFAULT_CMP0025=NEW
  -DCMAKE_OSX_SYSROOT=
  -DCMAKE_FIND_FRAMEWORK=LAST
  -DCMAKE_STRIP=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/strip
  -DCMAKE_RANLIB=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/ranlib
  -DCMAKE_AR=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/ar
  -DCMAKE_C_COMPILER=gcc
  -DCMAKE_CXX_COMPILER=g++
  -DCMAKE_INSTALL_PREFIX=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0
  -DNO_TUTORIALS=1
  -DNO_REGRESSION=1
  -DNO_EXAMPLES=1
  -DNO_METAL=1
  -DGLEW_INCLUDE_DIR=/nix/store/55n26bd7l2jdxj8fkh688nrv290d3hp8-glew-2.2.0-dev/include
  -DGLEW_LIBRARY=/nix/store/55n26bd7l2jdxj8fkh688nrv290d3hp8-glew-2.2.0-dev/lib
  -DOSD_CUDA_NVCC_FLAGS=--gpu-architecture=compute_37
  -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin/cc
  -DNO_OPENCL=1
  -DCUDA_TOOLKIT_ROOT_DIR=/nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0
  -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin
  -DCMAKE_CUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin
/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin/cc -DNO_OPENCL=1 -DCUDA_TOOLKIT_ROOT_DIR=/nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0 -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin -DCMAKE_CUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin 
-- The C compiler identification is GNU 12.3.0
-- The CXX compiler identification is GNU 12.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Compiling OpenSubdiv version v3_5_0
-- Using cmake version 3.26.4
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Could NOT find TBB (missing: TBB_INCLUDE_DIR TBB_LIBRARIES) (Required is at least version "4.0")
-- Found OpenGL: /nix/store/xibw0p5bj2z3a566mannk3vflb9f5fph-libGL-1.6.0/lib/libOpenGL.so   
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found CUDA: /nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0 (found suitable version "11.8", minimum required is "4.0") 
-- Found X11: /nix/store/gz38plw089ri9k2lh7gzhh58ydhb3rv1-xorgproto-2023.2/include   
-- Looking for XOpenDisplay in /nix/store/igp21718s3sa932z7baqnhlc72v0zl0z-libX11-1.8.6/lib/libX11.so;/nix/store/4s3wrg560496dx3qx8gnvvjqz4hc9222-libXext-1.3.5/lib/libXext.so
-- Looking for XOpenDisplay in /nix/store/igp21718s3sa932z7baqnhlc72v0zl0z-libX11-1.8.6/lib/libX11.so;/nix/store/4s3wrg560496dx3qx8gnvvjqz4hc9222-libXext-1.3.5/lib/libXext.so - found
-- Looking for gethostbyname
-- Looking for gethostbyname - found
-- Looking for connect
-- Looking for connect - found
-- Looking for remove
-- Looking for remove - found
-- Looking for shmat
-- Looking for shmat - found
-- Could NOT find GLFW (missing: GLFW_INCLUDE_DIR GLFW_LIBRARIES) (Required is at least version "3.0.0")
-- Could NOT find PTex (missing: PTEX_INCLUDE_DIR PTEX_LIBRARY) (Required is at least version "2.0")
-- Could NOT find ZLIB (missing: ZLIB_LIBRARY ZLIB_INCLUDE_DIR) (Required is at least version "1.2")
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) (Required is at least version "1.8.4")
-- Could NOT find Docutils (missing: RST2HTML_EXECUTABLE DOCUTILS_VERSION) (Required is at least version "0.9")
-- Found Python: /nix/store/9c03r86hcdn43dm3hsgjirifvyzfkhwh-python3-3.10.12/bin/python3.10 (found version "3.10.12") found components: Interpreter 
CMake Warning at CMakeLists.txt:430 (message):
  TBB was not found : support for TBB parallel compute kernels will be
  disabled in Osd.  If your compiler supports TBB directives, please refer to
  the FindTBB.cmake shared module in your cmake installation.


CMake Warning at CMakeLists.txt:619 (message):
  Ptex was not found : the OpenSubdiv Ptex example will not be available.  If
  you do have Ptex installed and see this message, please add your Ptex path
  to FindPTex.cmake in /build/source/cmake or set it through the
  PTEX_LOCATION cmake command line argument or environment variable.


CMake Warning at documentation/CMakeLists.txt:52 (message):
  Doxyen was not found : support for Doxygen automated API documentation is
  disabled.


-- Configuring done (3.6s)
-- Generating done (0.0s)
CMake Warning:
  Manually-specified variables were not used by the project:

    BUILD_TESTING
    CMAKE_EXPORT_NO_PACKAGE_REGISTRY
    CMAKE_POLICY_DEFAULT_CMP0025
    GLEW_LIBRARY


-- Build files have been written to: /build/source/build
cmake: enabled parallel building
cmake: enabled parallel installing
@nix { "action": "setPhase", "phase": "buildPhase" }
building
build flags: -j48 SHELL=/nix/store/a7f7xfp9wyghf44yv6l6fv9dfw492hd3-bash-5.2-p15/bin/bash

(Remainder of logs omitted)

amarshall avatar Jul 25 '23 17:07 amarshall

Thanks for the additional information!

davidgyu avatar Jul 27 '23 19:07 davidgyu

I just hit this failure when building nixpkgs. The build succeeded on retry. Just making it known that the workaround is not a silver bullet.

bonsairobo avatar Aug 20 '24 03:08 bonsairobo