SuGaR icon indicating copy to clipboard operation
SuGaR copied to clipboard

Fail train_full_pipeline after multiple hours

Open jongeorge1999 opened this issue 1 year ago • 2 comments

I am having some issues running train_full_pipeline to completion, I am getting an error stack:

Loading Vanilla 3DGS model config output/vanilla_gs/lh_5fps/...
Found image extension .png
Vanilla 3DGS Loaded.
211 training images detected.
The model has been trained for 7000 steps.
0.854081 M gaussians detected.
Binding radiance cloud to surface mesh...
Building UV map done.
Traceback (most recent call last):
  File "/root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/root/miniconda3/envs/sugar/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/c/users/george/desktop/sugar/train.py", line 197, in <module>
    refined_mesh_path = extract_mesh_and_texture_from_refined_sugar(refined_mesh_args)
  File "/mnt/c/users/george/desktop/sugar/sugar_extractors/refined_mesh.py", line 195, in extract_mesh_and_texture_from_refined_sugar
    textured_mesh = compute_textured_mesh_for_sugar_mesh(
  File "/root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/c/users/george/desktop/sugar/sugar_extractors/texture.py", line 104, in compute_textured_mesh_for_sugar_mesh
    rasterizer = MeshRasterizer(
  File "/mnt/c/users/george/desktop/sugar/sugar_utils/mesh_rasterization.py", line 102, in __init__
    self.gl_context = dr.RasterizeGLContext()
  File "/root/miniconda3/envs/sugar/lib/python3.9/site-packages/nvdiffrast/torch/ops.py", line 228, in __init__
    self.cpp_wrapper = _get_plugin(gl=True).RasterizeGLStateWrapper(output_db, mode == 'automatic', cuda_device_idx)
  File "/root/miniconda3/envs/sugar/lib/python3.9/site-packages/nvdiffrast/torch/ops.py", line 125, in _get_plugin
    torch.utils.cpp_extension.load(name=plugin_name, sources=source_paths, extra_cflags=common_opts+cc_opts, extra_cuda_cflags=common_opts+['-lineinfo'], extra_ldflags=ldflags, with_cuda=True, verbose=False)
  File "/root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'nvdiffrast_plugin_gl': [1/2] c++ -MMD -MF common.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin_gl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/include/TH -isystem /root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/sugar/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -DNVDR_TORCH -c /root/miniconda3/envs/sugar/lib/python3.9/site-packages/nvdiffrast/common/common.cpp -o common.o
FAILED: common.o
c++ -MMD -MF common.o.d -DTORCH_EXTENSION_NAME=nvdiffrast_plugin_gl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/include/TH -isystem /root/miniconda3/envs/sugar/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/sugar/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -DNVDR_TORCH -c /root/miniconda3/envs/sugar/lib/python3.9/site-packages/nvdiffrast/common/common.cpp -o common.o
In file included from /usr/local/cuda/include/cuda_runtime.h:83,
                 from /root/miniconda3/envs/sugar/lib/python3.9/site-packages/nvdiffrast/common/common.cpp:9:
/usr/local/cuda/include/crt/host_config.h:1:1: error: stray ‘\’ in program
    1 | \/*
      | ^
ninja: build stopped: subcommand failed.

Can anybody read into this or give me advice on how to proceed?

Im running an RTX4080 on windows 11 through WSL2

jongeorge1999 avatar Nov 30 '24 20:11 jongeorge1999

@jongeorge1999 @Anttwo Hi, I met the same problem as you. At first, I was reminded of the lack of ninja (the strange thing is that we are not required to download ninja in the readme), but after I downloaded ninja, I made the same mistake as you, which caused me a lot of trouble. May I ask if you have solved it now?

Image

qiuqingheng avatar Mar 02 '25 15:03 qiuqingheng

Same here, but I have a workaround solution. This problem is asociated with nvdiffrast. I have no idea why it is not working, but workaround would be to uninstall nvdiffrast. pip uninstall nvdiffrast Nvdiffrast is suppose to create mesh faster, but it is not needed. It works without it.

Coremar2 avatar Mar 20 '25 17:03 Coremar2