PUGeo
PUGeo copied to clipboard
RAM
Can you share minimal hardware requirements?
With test sample
python main.py --phase test --up_ratio 4 --pretrained PUGeo_x4/model/model-final --eval_xyz test_5000
I got:
tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2022-04-28 14:56:10.459296: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2022-04-28 14:56:10.466603: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
0%| | 0/57 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 299, in <module>
main(FLAGS)
File "main.py", line 155, in main
eval_shapes(arg, sess, ops, arg.up_ratio, arg.eval_xyz)
File "main.py", line 266, in eval_shapes
input_sparse_xyz_list, gen_dense_xyz_list, gen_dense_normal_list, gen_sparse_normal_list = eval_patches(normalize_sparse_xyz, sess, arg, ops)
File "main.py", line 245, in eval_patches
gen_dense_xyz, gen_dense_normal, gen_sparse_normal = eval_per_patch(input_sparse_xyz, sess, arg, ops)
File "main.py", line 219, in eval_per_patch
ops['input_r_pl']: np.ones([arg.batch_size], dtype='f')
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node generator/transform_net1/tconv1/Conv2D (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:221) ]]
[[Squeeze/_439]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node generator/transform_net1/tconv1/Conv2D (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:221) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node generator/transform_net1/tconv1/Conv2D:
generator/concat (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:778)
generator/transform_net1/tconv1/weights/read (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:23)
Input Source operations connected to node generator/transform_net1/tconv1/Conv2D:
generator/concat (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:778)
generator/transform_net1/tconv1/weights/read (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:23)
Original stack trace for u'generator/transform_net1/tconv1/Conv2D':
File "main.py", line 299, in <module>
main(FLAGS)
File "main.py", line 80, in main
gen_dense_xyz, gen_dense_normal, gen_sparse_normal = upsample_model.get_model(input_sparse_xyz_pl, arg.up_ratio, training_pl, knn=30, bradius=input_r_pl, scope='generator')
File "/media/maxim/information-60/PUGeo/model/model_pugeo.py", line 21, in get_model
transform = input_transform_net(edge_feature, is_training, bn_decay, K=3)
File "/media/maxim/information-60/PUGeo/utils/transform_nets.py", line 20, in input_transform_net
scope='tconv1', bn_decay=bn_decay, is_dist=is_dist)
File "/media/maxim/information-60/PUGeo/utils/tf_util.py", line 221, in conv2d
padding=padding)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
Whole output is the following:
python main.py --phase test --up_ratio 4 --pretrained PUGeo_x4/model/model-final --eval_xyz test_5000
Namespace(batch_size=8, eval_xyz='test_5000', gpu='0', jitter_max=0.03, jitter_sigma=0.01, learning_rate=0.001, log_dir='PUGeo_x4', max_epoch=400, model='model_pugeo', num_point=256, num_shape_point=5000, patch_num_ratio=3, phase='test', pretrained='PUGeo_x4/model/model-final', reg_normal1=0.1, reg_normal2=0.1, up_ratio=4)
WARNING:tensorflow:From main.py:68: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/model/model_pugeo.py:11: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/model/model_pugeo.py:11: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:715: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:23: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:50: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/transform_nets.py:26: calling reduce_max_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:435: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
WARNING:tensorflow:Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/tf_util.py:693: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /media/maxim/information-60/PUGeo/utils/loss.py:53: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From main.py:91: The name tf.losses.get_regularization_loss is deprecated. Please use tf.compat.v1.losses.get_regularization_loss instead.
WARNING:tensorflow:From main.py:102: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.
2022-04-28 14:56:07.739965: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2022-04-28 14:56:07.763486: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3699850000 Hz
2022-04-28 14:56:07.763959: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560763f604f0 executing computations on platform Host. Devices:
2022-04-28 14:56:07.763981: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2022-04-28 14:56:07.765599: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2022-04-28 14:56:07.769267: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.769502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: NVIDIA GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.815
pciBusID: 0000:01:00.0
2022-04-28 14:56:07.769544: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2022-04-28 14:56:07.771144: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-04-28 14:56:07.772410: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2022-04-28 14:56:07.772878: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2022-04-28 14:56:07.774220: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2022-04-28 14:56:07.775477: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2022-04-28 14:56:07.778371: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2022-04-28 14:56:07.778510: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.778686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.778794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2022-04-28 14:56:07.778830: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2022-04-28 14:56:07.932745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-04-28 14:56:07.932774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2022-04-28 14:56:07.932780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2022-04-28 14:56:07.932946: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.933089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.933201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-28 14:56:07.933292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6366 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2022-04-28 14:56:07.934438: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560764a9dca0 executing computations on platform CUDA. Devices:
2022-04-28 14:56:07.934450: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): NVIDIA GeForce RTX 2070, Compute Capability 7.5
2022-04-28 14:56:08.446546: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
WARNING:tensorflow:From main.py:137: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
WARNING:tensorflow:From /media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
0%| | 0/57 [00:00<?, ?it/s]2022-04-28 14:56:10.067236: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-04-28 14:56:10.216326: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2022-04-28 14:56:10.459296: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2022-04-28 14:56:10.466603: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
0%| | 0/57 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 299, in <module>
main(FLAGS)
File "main.py", line 155, in main
eval_shapes(arg, sess, ops, arg.up_ratio, arg.eval_xyz)
File "main.py", line 266, in eval_shapes
input_sparse_xyz_list, gen_dense_xyz_list, gen_dense_normal_list, gen_sparse_normal_list = eval_patches(normalize_sparse_xyz, sess, arg, ops)
File "main.py", line 245, in eval_patches
gen_dense_xyz, gen_dense_normal, gen_sparse_normal = eval_per_patch(input_sparse_xyz, sess, arg, ops)
File "main.py", line 219, in eval_per_patch
ops['input_r_pl']: np.ones([arg.batch_size], dtype='f')
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node generator/transform_net1/tconv1/Conv2D (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:221) ]]
[[Squeeze/_439]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node generator/transform_net1/tconv1/Conv2D (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:221) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node generator/transform_net1/tconv1/Conv2D:
generator/concat (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:778)
generator/transform_net1/tconv1/weights/read (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:23)
Input Source operations connected to node generator/transform_net1/tconv1/Conv2D:
generator/concat (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:778)
generator/transform_net1/tconv1/weights/read (defined at /media/maxim/information-60/PUGeo/utils/tf_util.py:23)
Original stack trace for u'generator/transform_net1/tconv1/Conv2D':
File "main.py", line 299, in <module>
main(FLAGS)
File "main.py", line 80, in main
gen_dense_xyz, gen_dense_normal, gen_sparse_normal = upsample_model.get_model(input_sparse_xyz_pl, arg.up_ratio, training_pl, knn=30, bradius=input_r_pl, scope='generator')
File "/media/maxim/information-60/PUGeo/model/model_pugeo.py", line 21, in get_model
transform = input_transform_net(edge_feature, is_training, bn_decay, K=3)
File "/media/maxim/information-60/PUGeo/utils/transform_nets.py", line 20, in input_transform_net
scope='tconv1', bn_decay=bn_decay, is_dist=is_dist)
File "/media/maxim/information-60/PUGeo/utils/tf_util.py", line 221, in conv2d
padding=padding)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/media/maxim/information-60/envs/pugeo-net/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
According some googling I figure out that it can be caused by lack of memory. Here some maximum GRAM consumption during run
nvidia-smi
Thu Apr 28 14:56:10 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 25% 32C P2 45W / 215W | 7758MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1607 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 1748 G /usr/bin/gnome-shell 68MiB |
| 0 N/A N/A 2497 G /usr/lib/xorg/Xorg 407MiB |
| 0 N/A N/A 2621 G /usr/bin/gnome-shell 71MiB |
| 0 N/A N/A 2993 G ...gAAAAAAAAA --shared-files 9MiB |
| 0 N/A N/A 3132 G ...oken=15871595316042295885 7MiB |
| 0 N/A N/A 3321 G ...569280287605370747,131072 409MiB |
| 0 N/A N/A 3905 G ...RendererForSitePerProcess 18MiB |
| 0 N/A N/A 18463 C python 6571MiB |
| 0 N/A N/A 25101 C+G colmap 165MiB |
+-----------------------------------------------------------------------------+
How much video memory do I need? Which device did you test on? I have NVIDIA GeForce RTX 2070