PUGeo icon indicating copy to clipboard operation
PUGeo copied to clipboard

RAM

Open MaksymTymkovych opened this issue 2 years ago • 0 comments

Can you share minimal hardware requirements?

With test sample python main.py --phase test --up_ratio 4 --pretrained PUGeo_x4/model/model-final --eval_xyz test_5000

I got:

 tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2022-04-28 14:56:10.459296: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2022-04-28 14:56:10.466603: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
  0%|          | 0/57 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 299, in <module>
    main(FLAGS)
  File "main.py", line 155, in main
    eval_shapes(arg, sess, ops, arg.up_ratio, arg.eval_xyz)
  File "main.py", line 266, in eval_shapes
    input_sparse_xyz_list, gen_dense_xyz_list, gen_dense_normal_list, gen_sparse_normal_list = eval_patches(normalize_sparse_xyz, sess, arg, ops)
  File "main.py", line 245, in eval_patches
    gen_dense_xyz, gen_dense_normal, gen_sparse_normal = eval_per_patch(input_sparse_xyz, sess, arg, ops)
  File "main.py", line 219, in eval_per_patch
    ops['input_r_pl']: np.ones([arg.batch_size], dtype='f')
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node generator/transform_net1/tconv1/Conv2D (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:221) ]]
	 [[Squeeze/_439]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node generator/transform_net1/tconv1/Conv2D (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:221) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node generator/transform_net1/tconv1/Conv2D:
 generator/concat (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:778)	
 generator/transform_net1/tconv1/weights/read (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:23)

Input Source operations connected to node generator/transform_net1/tconv1/Conv2D:
 generator/concat (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:778)	
 generator/transform_net1/tconv1/weights/read (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:23)

Original stack trace for u'generator/transform_net1/tconv1/Conv2D':
  File "main.py", line 299, in <module>
    main(FLAGS)
  File "main.py", line 80, in main
    gen_dense_xyz, gen_dense_normal, gen_sparse_normal = upsample_model.get_model(input_sparse_xyz_pl, arg.up_ratio, training_pl, knn=30, bradius=input_r_pl, scope='generator')
  File "/media/maxim/information-60/PUGeo/model/model_pugeo.py", line 21, in get_model
    transform = input_transform_net(edge_feature, is_training, bn_decay, K=3)
  File "/media/maxim/information-60/PUGeo/utils/transform_nets.py", line 20, in input_transform_net
    scope='tconv1', bn_decay=bn_decay, is_dist=is_dist)
  File "/media/maxim/information-60/PUGeo/utils/tf_util.py", line 221, in conv2d
    padding=padding)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
    name=name)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Whole output is the following:

python main.py --phase test --up_ratio 4 --pretrained PUGeo_x4/model/model-final --eval_xyz test_5000
Namespace(batch_size=8, eval_xyz='test_5000', gpu='0', jitter_max=0.03, jitter_sigma=0.01, learning_rate=0.001, log_dir='PUGeo_x4', max_epoch=400, model='model_pugeo', num_point=256, num_shape_point=5000, patch_num_ratio=3, phase='test', pretrained='PUGeo_x4/model/model-final', reg_normal1=0.1, reg_normal2=0.1, up_ratio=4)
WARNING:tensorflow:From main.py:68: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /media/maxim/information-60/PUGeo/model/model_pugeo.py:11: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /media/maxim/information-60/PUGeo/model/model_pugeo.py:11: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:715: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:23: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:50: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead.

WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/transform_nets.py:26: calling reduce_max_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:435: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:693: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/loss.py:53: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From main.py:91: The name tf.losses.get_regularization_loss is deprecated. Please use tf.compat.v1.losses.get_regularization_loss instead.

WARNING:tensorflow:From main.py:102: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

2022-04-28 14:56:07.739965: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2022-04-28 14:56:07.763486: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3699850000 Hz
2022-04-28 14:56:07.763959: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560763f604f0 executing computations on platform Host. Devices:
2022-04-28 14:56:07.763981: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2022-04-28 14:56:07.765599: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2022-04-28 14:56:07.769267: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.769502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: NVIDIA GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.815
pciBusID: 0000:01:00.0
2022-04-28 14:56:07.769544: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2022-04-28 14:56:07.771144: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-04-28 14:56:07.772410: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2022-04-28 14:56:07.772878: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2022-04-28 14:56:07.774220: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2022-04-28 14:56:07.775477: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2022-04-28 14:56:07.778371: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2022-04-28 14:56:07.778510: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.778686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.778794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2022-04-28 14:56:07.778830: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2022-04-28 14:56:07.932745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-04-28 14:56:07.932774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2022-04-28 14:56:07.932780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2022-04-28 14:56:07.932946: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.933089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.933201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.933292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6366 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2022-04-28 14:56:07.934438: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560764a9dca0 executing computations on platform CUDA. Devices:
2022-04-28 14:56:07.934450: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): NVIDIA GeForce RTX 2070, Compute Capability 7.5
2022-04-28 14:56:08.446546: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
WARNING:tensorflow:From main.py:137: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
  0%|          | 0/57 [00:00<?, ?it/s]2022-04-28 14:56:10.067236: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-04-28 14:56:10.216326: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2022-04-28 14:56:10.459296: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2022-04-28 14:56:10.466603: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
  0%|          | 0/57 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 299, in <module>
    main(FLAGS)
  File "main.py", line 155, in main
    eval_shapes(arg, sess, ops, arg.up_ratio, arg.eval_xyz)
  File "main.py", line 266, in eval_shapes
    input_sparse_xyz_list, gen_dense_xyz_list, gen_dense_normal_list, gen_sparse_normal_list = eval_patches(normalize_sparse_xyz, sess, arg, ops)
  File "main.py", line 245, in eval_patches
    gen_dense_xyz, gen_dense_normal, gen_sparse_normal = eval_per_patch(input_sparse_xyz, sess, arg, ops)
  File "main.py", line 219, in eval_per_patch
    ops['input_r_pl']: np.ones([arg.batch_size], dtype='f')
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node generator/transform_net1/tconv1/Conv2D (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:221) ]]
	 [[Squeeze/_439]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node generator/transform_net1/tconv1/Conv2D (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:221) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node generator/transform_net1/tconv1/Conv2D:
 generator/concat (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:778)	
 generator/transform_net1/tconv1/weights/read (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:23)

Input Source operations connected to node generator/transform_net1/tconv1/Conv2D:
 generator/concat (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:778)	
 generator/transform_net1/tconv1/weights/read (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:23)

Original stack trace for u'generator/transform_net1/tconv1/Conv2D':
  File "main.py", line 299, in <module>
    main(FLAGS)
  File "main.py", line 80, in main
    gen_dense_xyz, gen_dense_normal, gen_sparse_normal = upsample_model.get_model(input_sparse_xyz_pl, arg.up_ratio, training_pl, knn=30, bradius=input_r_pl, scope='generator')
  File "/media/maxim/information-60/PUGeo/model/model_pugeo.py", line 21, in get_model
    transform = input_transform_net(edge_feature, is_training, bn_decay, K=3)
  File "/media/maxim/information-60/PUGeo/utils/transform_nets.py", line 20, in input_transform_net
    scope='tconv1', bn_decay=bn_decay, is_dist=is_dist)
  File "/media/maxim/information-60/PUGeo/utils/tf_util.py", line 221, in conv2d
    padding=padding)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
    name=name)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

According some googling I figure out that it can be caused by lack of memory. Here some maximum GRAM consumption during run

nvidia-smi
Thu Apr 28 14:56:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 25%   32C    P2    45W / 215W |   7758MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1607      G   /usr/lib/xorg/Xorg                 18MiB |
|    0   N/A  N/A      1748      G   /usr/bin/gnome-shell               68MiB |
|    0   N/A  N/A      2497      G   /usr/lib/xorg/Xorg                407MiB |
|    0   N/A  N/A      2621      G   /usr/bin/gnome-shell               71MiB |
|    0   N/A  N/A      2993      G   ...gAAAAAAAAA --shared-files        9MiB |
|    0   N/A  N/A      3132      G   ...oken=15871595316042295885        7MiB |
|    0   N/A  N/A      3321      G   ...569280287605370747,131072      409MiB |
|    0   N/A  N/A      3905      G   ...RendererForSitePerProcess       18MiB |
|    0   N/A  N/A     18463      C   python                           6571MiB |
|    0   N/A  N/A     25101    C+G   colmap                            165MiB |
+-----------------------------------------------------------------------------+

How much video memory do I need? Which device did you test on? I have NVIDIA GeForce RTX 2070

MaksymTymkovych avatar Apr 28 '22 12:04 MaksymTymkovych