nerf
nerf copied to clipboard
Resource exhausted when running test example
When having fixed the first crash (by creating the missing summaries folder), I get a new crash (still when running python run_nerf.py --config config_fern.txt
):
2020-11-04 01:20:46.313269: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at concat_op.cc:153 : Resource exhausted: OOM when allocating tensor with shape[4194304,90] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "D:/third-part/nerf/run_nerf.py", line 928, in <module>
train()
File "D:/third-part/nerf/run_nerf.py", line 893, in train
**render_kwargs_test)
File "D:/third-part/nerf/run_nerf.py", line 328, in render
all_ret = batchify_rays(rays, chunk, **kwargs)
File "D:/third-part/nerf/run_nerf.py", line 250, in batchify_rays
ret = render_rays(rays_flat[i:i+chunk], **kwargs)
File "D:/third-part/nerf/run_nerf.py", line 227, in render_rays
raw = network_query_fn(pts, viewdirs, run_fn)
File "D:/third-part/nerf/run_nerf.py", line 410, in network_query_fn
netchunk=args.netchunk)
File "D:/third-part/nerf/run_nerf.py", line 40, in run_network
embedded = tf.concat([embedded, embedded_dirs], -1)
File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\tensorflow_core\python\util\dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\tensorflow_core\python\ops\array_ops.py", line 1420, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\tensorflow_core\python\ops\gen_array_ops.py", line 1249, in concat_v2
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4194304,90] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat
It looks like your GPU is running out of memory. The code config is for a v100 which as 16GB of memory. If you are running on a card with less, try decreasing the batch size (N_rand in the config file).
I decreased N_rand
to 300 but I'm still getting the same error message. The shape of the tensor that the script attempts to create is still shape[4194304,90]
, so it doesn't seem to have had any effect.
Ahh, it looks like your entire dataset doesn't fit on the GPU. Try adding --no_batching
, this will load one image at a time during training. The results will be worse, but requires less memory.
It doesn't do the trick. When setting N_rand
to 1 and using the --no_batching
option, I'm still getting
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4194304,90] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat
I'm using a somewhat old GPU, though; a GTX 1060 with 6BG of memory.
However, if I also decrease N_samples
and N_importance
to 4, I get a different error:
Traceback (most recent call last):
File "D:/third-part/nerf/run_nerf.py", line 928, in <module>
train()
File "D:/third-part/nerf/run_nerf.py", line 901, in train
imageio.imwrite(os.path.join(testimgdir, '{:06d}.png'.format(i)), to8b(rgb))
File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\imageio\core\functions.py", line 303, in imwrite
writer = get_writer(uri, format, "i", **kwargs)
File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\imageio\core\functions.py", line 217, in get_writer
request = Request(uri, "w" + mode, **kwargs)
File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\imageio\core\request.py", line 124, in __init__
self._parse_uri(uri)
File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\imageio\core\request.py", line 265, in _parse_uri
raise FileNotFoundError("The directory %r does not exist" % dn)
FileNotFoundError: The directory 'D:\\third-part\\nerf\\logs\\fern_test\\tboard_val_imgs' does not exist
Every time, it seems to run some iterations of some loop before the error occurs, though, as before the error, I also get the following:
fern_test 1 11.88098 0.1640698 1
iter time 0.73457
fern_test 2 22.033655 0.020605285 2
iter time 0.09175
fern_test 3 22.488821 0.1400678 3
iter time 0.08976
fern_test 4 30.131605 0.08863868 4
iter time 0.08777
fern_test 5 6.055036 0.45202515 5
iter time 0.09375
fern_test 6 8.739907 0.22000109 6
iter time 0.07879
fern_test 7 17.963327 0.052584127 7
iter time 0.07480
fern_test 8 22.69207 0.010006983 8
iter time 0.07879
fern_test 9 21.909454 0.022964481 9
iter time 0.08178
fern_test 100 15.195176 0.06504251 100
iter time 0.07651
fern_test 200 26.161882 0.36329496 200
iter time 0.07482
fern_test 300 25.880293 0.006590659 300
iter time 0.07441
fern_test 400 19.069527 0.018049838 400
iter time 0.08732
fern_test 500 24.096317 0.008710936 500
iter time 0.07558
I created the directory tboard_val_imgs
manually, and now the script seems to run and populate the folder with some images.
The ResourceExhaustedError
seems to occur only when the script tries to generate an output image, as it runs to fern_test 500
before crashing. It turned out to be sufficient to just set N_samples
and N_importance
to 50 (I haven't checked the exact limits) and leave the other settings unchanged (and not use the --no_batching
option). However, I'm using a batch size of 300 now since on my GPU each batch seems to take only about a third of the time compared to when it is 1024.
Ok, if it is breaking when rendering the test example decrease the value for --chunk
.
Reducing --chunk
instead of decreasing N_samples
and N_importance
(with the same proportional amount) indeed seems to work too. Is that maybe to prefer?
Yes, reducing chunk
is preferred. It does not change to output of the training (unlike N_samples
and N_importance
), it just runs slower during the evaluations.