ComplexGen copied to clipboard
Trying to allocate about 5000 GiB
Hello! I'm interested in your great work and trying to run your code. Although, I have a little trouble solving following error that says I'm trying to allocate about 5000 GiB. I think the number is too large. Do you have any idea about this error regarding your data or model size, etc?
- Docker image: pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
- GPU: GeForce RTX 3090, 24GB
root@oucyz:/workspace# scripts/
not detected /blob directory, execute locally
Utilize 1 gpus
/root/.local/lib/python3.8/site-packages/MinkowskiEngine/ UserWarning: The environment variable `OMP_NUM_THREADS` not set. MinkowskiEngine will automatically set `OMP_NUM_THREADS=16`. If you want to set `OMP_NUM_THREADS` manually, please export it on the command line before running a python script. e.g. `export OMP_NUM_THREADS=12; python`. It is recommended to set it below 24.
using instance norm
2022-11-07 09:07:10.839856: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-07 09:07:11.057767: E tensorflow/stream_executor/cuda/] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-07 09:07:11.771248: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-07 09:07:11.771496: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-07 09:07:11.771546: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
not detected /blob directory, execute locally
Utilize 1 gpus
using instance norm
2022-11-07 09:07:13.650011: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-07 09:07:13.897804: E tensorflow/stream_executor/cuda/] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-07 09:07:14.678678: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-07 09:07:14.678788: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-07 09:07:14.678800: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
load data from data/train_small
packed pkl folder detected, will load from packed pkl file
Successfully Loaded from 19 files:19
max number of corners in single sample: 32
2 curves at least
448 valid curves total
250 valid corners total
225 patches total
min and max points in single patch: 512 512
0 open shapes
squared curve length statistics: 448 3.682233176771585e-05 9.269843007646973 0.3276165163696903
patch area statistics: 225 0.000722008498996729 1.3858339398102544 0.1676870428241
normal is included in input signal
load data from data/train_small
packed pkl folder detected, will load from packed pkl file
Successfully Loaded from 19 files:19
max number of corners in single sample: 32
2 curves at least
448 valid curves total
250 valid corners total
225 patches total
min and max points in single patch: 512 512
0 open shapes
squared curve length statistics: 448 3.682233176771585e-05 9.269843007646973 0.3276165163696903
patch area statistics: 225 0.000722008498996729 1.3858339398102544 0.1676870428241
normal is included in input signal
number of params: 22057152 87052323
Try to restore from checkpoint
0%| | 0/5 [00:00<?, ?it/s]Start Training
train data size 19
/workspace/ NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "points2sparse_voxel" failed type inference due to: No implementation of function Function(<function norm at 0x7fb7228699d0>) found for signature:
>>> norm(array(float32, 2d, A), axis=Literal[int](1), keepdims=Literal[bool](True))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'norm_impl': File: numba/np/ Line 2351.
With argument(s): '(array(float32, 2d, A), axis=int64, keepdims=bool)':
Rejected as the implementation raised a specific error:
TypingError: got an unexpected keyword argument 'axis'
raised from /root/.local/lib/python3.8/site-packages/numba/core/typing/
During: resolving callee type: Function(<function norm at 0x7fb7228699d0>)
During: typing of call at /workspace/ (255)
File "", line 255:
def points2sparse_voxel(points_with_normal, voxel_dim, feature_type, with_normal, pad1s):
<source elided>
voxel_coord = np.clip(np.floor(points / voxel_length).astype(np.int32), 0, voxel_dim-1)
points_normal_norm = linalg.norm(points_with_normal[:,3:], axis=1, keepdims=True)
/root/.local/lib/python3.8/site-packages/numba/core/ NumbaWarning: Function "points2sparse_voxel" was compiled in object mode without forceobj=True.
File "", line 249:
def points2sparse_voxel(points_with_normal, voxel_dim, feature_type, with_normal, pad1s):
/root/.local/lib/python3.8/site-packages/numba/core/ NumbaDeprecationWarning:
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
For more information visit
File "", line 249:
def points2sparse_voxel(points_with_normal, voxel_dim, feature_type, with_normal, pad1s):
0%| | 0/5 [00:05<?, ?it/s]
Traceback (most recent call last):
File "", line 4582, in <module>
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/", line 157, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/", line 118, in join
raise Exception(msg)
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/", line 19, in _wrap
fn(i, *args)
File "/workspace/", line 4370, in pipeline_abc
patch_loss_dict, patch_matching_indices = patch_loss_criterion(patch_predictions, target_patches_list)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/", line 2532, in forward
losses.update(self.get_loss(loss, outputs, targets, indices, num_corners))
File "/workspace/", line 2495, in get_loss
return loss_map[loss](outputs, targets, indices, num_patches, **kwargs)
File "/workspace/", line 2220, in loss_geometry
loss_geom[uclose_id] = emd_by_id(target_patch_points_batch[uclose_id], src_patch_points[uclose_id], self.emd_idlist_u, points_per_patch_dim)
RuntimeError: CUDA out of memory. Tried to allocate 4966.70 GiB (GPU 0; 23.69 GiB total capacity; 9.58 GiB already allocated; 12.36 GiB free; 9.62 GiB reserved in total by PyTorch)
hi, oucyz. I've not encounter this issue before. Have you tried to forward the checkpoint directly by running: ./scripts/