nerf icon indicating copy to clipboard operation
nerf copied to clipboard

Resource exhausted when running test example

Open krikru opened this issue 3 years ago • 9 comments

When having fixed the first crash (by creating the missing summaries folder), I get a new crash (still when running python run_nerf.py --config config_fern.txt):

2020-11-04 01:20:46.313269: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at concat_op.cc:153 : Resource exhausted: OOM when allocating tensor with shape[4194304,90] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "D:/third-part/nerf/run_nerf.py", line 928, in <module>
    train()
  File "D:/third-part/nerf/run_nerf.py", line 893, in train
    **render_kwargs_test)
  File "D:/third-part/nerf/run_nerf.py", line 328, in render
    all_ret = batchify_rays(rays, chunk, **kwargs)
  File "D:/third-part/nerf/run_nerf.py", line 250, in batchify_rays
    ret = render_rays(rays_flat[i:i+chunk], **kwargs)
  File "D:/third-part/nerf/run_nerf.py", line 227, in render_rays
    raw = network_query_fn(pts, viewdirs, run_fn)
  File "D:/third-part/nerf/run_nerf.py", line 410, in network_query_fn
    netchunk=args.netchunk)
  File "D:/third-part/nerf/run_nerf.py", line 40, in run_network
    embedded = tf.concat([embedded, embedded_dirs], -1)
  File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\tensorflow_core\python\util\dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\tensorflow_core\python\ops\array_ops.py", line 1420, in concat
    return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
  File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\tensorflow_core\python\ops\gen_array_ops.py", line 1249, in concat_v2
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4194304,90] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat

krikru avatar Nov 04 '20 00:11 krikru

It looks like your GPU is running out of memory. The code config is for a v100 which as 16GB of memory. If you are running on a card with less, try decreasing the batch size (N_rand in the config file).

tancik avatar Nov 05 '20 06:11 tancik

I decreased N_rand to 300 but I'm still getting the same error message. The shape of the tensor that the script attempts to create is still shape[4194304,90], so it doesn't seem to have had any effect.

krikru avatar Nov 11 '20 12:11 krikru

Ahh, it looks like your entire dataset doesn't fit on the GPU. Try adding --no_batching, this will load one image at a time during training. The results will be worse, but requires less memory.

tancik avatar Nov 11 '20 17:11 tancik

It doesn't do the trick. When setting N_rand to 1 and using the --no_batching option, I'm still getting

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4194304,90] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat

I'm using a somewhat old GPU, though; a GTX 1060 with 6BG of memory.

However, if I also decrease N_samples and N_importance to 4, I get a different error:

Traceback (most recent call last):
  File "D:/third-part/nerf/run_nerf.py", line 928, in <module>
    train()
  File "D:/third-part/nerf/run_nerf.py", line 901, in train
    imageio.imwrite(os.path.join(testimgdir, '{:06d}.png'.format(i)), to8b(rgb))
  File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\imageio\core\functions.py", line 303, in imwrite
    writer = get_writer(uri, format, "i", **kwargs)
  File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\imageio\core\functions.py", line 217, in get_writer
    request = Request(uri, "w" + mode, **kwargs)
  File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\imageio\core\request.py", line 124, in __init__
    self._parse_uri(uri)
  File "D:\ProgramData\Miniconda3\envs\nerf\lib\site-packages\imageio\core\request.py", line 265, in _parse_uri
    raise FileNotFoundError("The directory %r does not exist" % dn)
FileNotFoundError: The directory 'D:\\third-part\\nerf\\logs\\fern_test\\tboard_val_imgs' does not exist

Every time, it seems to run some iterations of some loop before the error occurs, though, as before the error, I also get the following:

fern_test 1 11.88098 0.1640698 1
iter time 0.73457
fern_test 2 22.033655 0.020605285 2
iter time 0.09175
fern_test 3 22.488821 0.1400678 3
iter time 0.08976
fern_test 4 30.131605 0.08863868 4
iter time 0.08777
fern_test 5 6.055036 0.45202515 5
iter time 0.09375
fern_test 6 8.739907 0.22000109 6
iter time 0.07879
fern_test 7 17.963327 0.052584127 7
iter time 0.07480
fern_test 8 22.69207 0.010006983 8
iter time 0.07879
fern_test 9 21.909454 0.022964481 9
iter time 0.08178
fern_test 100 15.195176 0.06504251 100
iter time 0.07651
fern_test 200 26.161882 0.36329496 200
iter time 0.07482
fern_test 300 25.880293 0.006590659 300
iter time 0.07441
fern_test 400 19.069527 0.018049838 400
iter time 0.08732
fern_test 500 24.096317 0.008710936 500
iter time 0.07558

krikru avatar Nov 11 '20 21:11 krikru

I created the directory tboard_val_imgs manually, and now the script seems to run and populate the folder with some images.

krikru avatar Nov 11 '20 21:11 krikru

The ResourceExhaustedError seems to occur only when the script tries to generate an output image, as it runs to fern_test 500 before crashing. It turned out to be sufficient to just set N_samples and N_importance to 50 (I haven't checked the exact limits) and leave the other settings unchanged (and not use the --no_batching option). However, I'm using a batch size of 300 now since on my GPU each batch seems to take only about a third of the time compared to when it is 1024.

krikru avatar Nov 11 '20 21:11 krikru

Ok, if it is breaking when rendering the test example decrease the value for --chunk.

tancik avatar Nov 11 '20 22:11 tancik

Reducing --chunk instead of decreasing N_samples and N_importance (with the same proportional amount) indeed seems to work too. Is that maybe to prefer?

krikru avatar Nov 15 '20 23:11 krikru

Yes, reducing chunk is preferred. It does not change to output of the training (unlike N_samples and N_importance), it just runs slower during the evaluations.

tancik avatar Nov 16 '20 17:11 tancik