nvdiffrec icon indicating copy to clipboard operation
nvdiffrec copied to clipboard

Pending after computing MSE and PSNR

Open hzhshok opened this issue 2 years ago • 8 comments

Hello, Recently i made myself test data to practically check what resolution and parameters is suitable for my desktop, i made the resolution of 1264x1264, and last it pended even for 24 hours, i checked it consumed the approximate 7408g GPU, so, is my data too rough(MSE - 0.2591xx, last img_loss=1.109134) or the other reason?

So, someone give me a share/suggestion about how to do to my high resolution please? e.g. the relationship, 11g GPU: max resolution and scale.

In addition: actually my images is very high resolution(3000x4000), but to my GPU(3080 11g), i scale that to 1264x1264 after some tries.

Hardware: DESKTOP RTX 3080(11g), cuda 11.3. Software: ubuntu20.04 Running GPU cost during pending: always fixed cost, approximate 7.xg Running output and strace log: see the following.

The console output: python3 train.py --config ./configs/nerf_handong.json Config / Flags:

config ./configs/nerf_handong.json iter 2000 batch 1 spp 1 layers 4 train_res [1264, 1264] display_res [1264, 1264] texture_res [2048, 2048] display_interval 0 save_interval 100 learning_rate [0.03, 0.01] min_roughness 0.08 custom_mip False random_textures True background white loss logl1 out_dir out/nerf_handong ref_mesh data/nerf_synthetic/handong base_mesh None validate True mtl_override None dmtet_grid 128 mesh_scale 2.75 env_scale 1.0 envmap None display [{'latlong': True}, {'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}] camera_space_light False lock_light False lock_pos False sdf_regularizer 0.2 laplace relative laplace_scale 3000 pre_load True kd_min [0.0, 0.0, 0.0, 0.0] kd_max [1.0, 1.0, 1.0, 1.0] ks_min [0, 0.1, 0.0] ks_max [1.0, 1.0, 1.0] nrm_min [-1.0, -1.0, 0.0] nrm_max [1.0, 1.0, 1.0] cam_near_far [0.1, 1000.0] learn_light True local_rank 0 multi_gpu False

DatasetNERF: 28 images with shape [1264, 1264] DatasetNERF: 28 images with shape [1264, 1264] Encoder output: 32 dims Using /home/xx/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/xx/.cache/torch_extensions/py38_cu113/renderutils_plugin/build.ninja... Building extension module renderutils_plugin... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module renderutils_plugin... iter= 0, img_loss=1.166577, reg_loss=0.334082, lr=0.02999, time=468.3 ms, rem=15.61 m ... iter= 1990, img_loss=0.936701, reg_loss=0.014961, lr=0.01199, time=484.8 ms, rem=4.85 s iter= 2000, img_loss=1.109134, reg_loss=0.014806, lr=0.01194, time=503.9 ms, rem=0.00 s Running validation MSE, PSNR 0.25911513, 6.480

^C^CKilled

strace log: sched_yield() = 0 sched_yield() = 0 sched_yield() = 0 ... sched_yield() = 0 sched_yield() = 0

Thanks for your contribution!

Regards

hzhshok avatar Apr 02 '22 02:04 hzhshok

After the first pass, we run xatlas to create a UV parameterization on the triangle mesh. If the first pass failed to create a reasonable mesh, this step can take quite some time or even fail. How does the mesh look in your case at the end of the first pass?

For memory consumption, you can log the usage using nvidia-smi --query-gpu=memory.used --format=csv -lms 100 when you run to get a feel for the usage. Memory is a function of image resolution, batch size and if you have depth peeling enabled. We ran the results in the paper using HPUs with 32+GB of memory, but it should run on lower-spec GPUs if you decrease the rendering resolution and/or batch size.

jmunkberg avatar Apr 04 '22 05:04 jmunkberg

Thanks @jmunkberg!

Why i raised this issue is because the general command 'nvidia-smi' without parameters showed it only used 7.4G totally, and that GPU memory has 11G, so i suspected that is not memory issue.

Sorry, once i checked the related output including mesh, but i forgot what output it is, and now that host was destroyed, so i am only able to continue to track this after making my new environment.

In addition, the GPU system even was made pending(nvidia-smi was blocked, no any response.) every time train.py failed. I understand why this feature uses Starvation algorithm for the GPU memory allocation, but,

Does team have the plan to optimize this feature for it's memory allocation strategy?

Regards

hzhshok avatar Apr 10 '22 07:04 hzhshok

Hello @hzhshok ,

Looking at the error metrics:

MSE, PSNR 0.25911513, 6.480

That's extremely large errors, so I assume the first pass did not train properly. What do the images look like in the out/nerf_handong folder (or the name of your current experiment)? If the reconstruction succeeded, I would expect a PSNR of 25 dB or higher. If the reconstruction fails, it is very hard to create a uv-parameterization (it is hard to uv map a triangle soup), and xatlas would fail/hang.

I suspect something else is wrong already in the first pass. A few things that can affect quality:

  • Is the lighting setup constant in all training data?
  • Do you have high quality foreground segmentation masks?
  • Are you sure that the poses are correct and that the pose indices and corresponding images match?
  • Does the training images contain substantial motion or defocus blur

Also, just to verify, is the example from the readme python train.py --config configs/bob.json working without issues on your setup?

jmunkberg avatar Apr 13 '22 11:04 jmunkberg

Hello, Thanks @jmunkberg! Yes, that something should be wrong in the first pass, and i will check the trained effect of the labeled images using current images as the fundamental.

Is the lighting setup constant in all training data? -- No, some of images has the strong light, because this images is from outdoor. Do you have high quality foreground segmentation masks? --Do you mean the high quality foreground is that the image was labeled on the pure background as the examples(chair or other...)? This time, I just used the wild outdoor images to check if this can be done for non-constant light and relatively pure color in images, and did not label the images, and i will check the effect with the high quality foreground.-) In addition, i just want to check the effect of texturing mesh after using the traditional sfm tool. Are you sure that the poses are correct and that the pose indices and corresponding images match? -- Yes, it should be, the colmap can get result although it did not had good effect. Does the training images contain substantial motion or defocus blur -- it no more blur, that images are resized from high resolution 3003x4000 to 1264x1264, i think i should have impact, but

Regards

hzhshok avatar Apr 14 '22 13:04 hzhshok

Hello, Now i found other samples about building to try this feature, and labeled the samples to eliminate the background, it seems like passing the first train/mesh , but it failed to do second train/mesh, so please give a suggestion, thanks!

The question: a. Why it breaked off during running? the log seeing the console output? Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function texture2d_mipBackward returned an invalid gradient at index 0 - got [1, 4, 4, 3] but expected shape compatible with [1, 5, 5, 3] Maybe the MSE/PSNR are not the good one, if so, could you help give me suggestion please about how to imporve training images aspects? or trainning images not suitable the rule? b. Why the result have serious blur? see the intermediate picture and two original image sampels. c. How to imporve the last result? input trining images or other.

img_dmtet_pass1_000000 img_dmtet_pass1_000041 img_dmtet_pass1_000045 img_dmtet_pass1_000079 img_dmtet_pass1_000080

IMG_4287 IMG_4332

Hardware: 3080(24G) - win11

Samples: (see the attached two picture for the example) a. 50 images. b. Resolution: 2456(width)x1638(hight) c. Transform parameters with: --aabb_scale 2 for colmap.

{ "ref_mesh": "data/nerf_synthetic/building", "random_textures": true, "iter": 8000, "save_interval": 100, "texture_res": [5120,5120], "train_res": [1638, 2456], "batch": 1, "learning_rate": [0.03, 0.0001], "ks_min" : [0, 0.08, 0.0], "dmtet_grid" : 128, "mesh_scale" : 5, "laplace_scale" : 3000, "display": [{"latlong" : true}, {"bsdf" : "kd"}, {"bsdf" : "ks"}, {"bsdf" : "normal"}], "layers" : 4, "background" : "white", "out_dir": "nerf_building" }

The console log: iter= 8000, img_loss=0.061592, reg_loss=0.016066, lr=0.00075, time=499.0 ms, rem=0.00 s Running validation MSE, PSNR 0.02140407, 17.038 Base mesh has 214359 triangles and 105082 vertices. Writing mesh: out/nerf_building\dmtet_mesh/mesh.obj writing 105082 vertices writing 224020 texcoords writing 105082 normals writing 214359 faces Writing material: out/nerf_building\dmtet_mesh/mesh.mtl Done exporting mesh Traceback (most recent call last): File "D:\zhansheng\proj\windows\nvdiffrec\train.py", line 620, in geometry, mat = optimize_mesh(glctx, geometry, base_mesh.material, lgt, dataset_train, dataset_validate, FLAGS, File "D:\zhansheng\proj\windows\nvdiffrec\train.py", line 428, in optimize_mesh total_loss.backward() File "C:\Users\jinshui\anaconda3\envs\dmodel\lib\site-packages\torch_tensor.py", line 363, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "C:\Users\jinshui\anaconda3\envs\dmodel\lib\site-packages\torch\autograd_init_.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function texture2d_mipBackward returned an invalid gradient at index 0 - got [1, 4, 4, 3] but expected shape compatible with [1, 5, 5, 3]

Regards

hzhshok avatar May 10 '22 01:05 hzhshok

@hzhshok I ran into the same error about "texture2d_mipBackward returned an invalid gradient" , in my case, it is "got [1,2,2,3] but expected [1,3,3,3]". Did you solve this issue? or did you find the reason about it .

@jmunkberg thanks for your great work btw, what would you suggest that can cause this issue? bad segmentation or sth. ?

ZirongChan avatar Aug 03 '22 11:08 ZirongChan

That is an error from nvdiffrast. I would try to use power-of-two resolutions on the textures and training, e.g.,

    "texture_res": [ 1024, 1024 ],
    "train_res": [1024, 1024],

In case the texture2d_mipBackward is not stable for all (non-pow2) resolutions.

jmunkberg avatar Aug 03 '22 11:08 jmunkberg

@ZirongChan, i am not sure if it is caused by memory, you know i used images near 2kX2k which costed more memory for my single 24G GPU, so I used the strategy of spliting image to small blocks as Jmunkberg talked inside other issue, at least it did not happen such error.

Regards

hzhshok avatar Aug 10 '22 10:08 hzhshok