ldif icon indicating copy to clipboard operation
ldif copied to clipboard

Segmentation fault when train the net

Open Harvey-Mei opened this issue 4 years ago • 3 comments

python train.py --batch_size 24 --experiment_name shapenet-ldif
--model_directory $models --model_type "ldif"
--dataset_directory $dataset WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. INFO: Making dataset... INFO: Optimized dataset detected at ./shapenet/optimized INFO: Mapping... INFO: is_invalid vs lower_coords: [24, 32, 1] vs [24, 32, 3] INFO: Post-where lower_coords: [24, 32, 3] INFO: is_invalid vs sdf coords: [24, 32, 1] vs [24, 32, 1] INFO: In-out image summaries have been removed. INFO: The 0-th GPU has 22390 MB free. INFO: TensorFlow can use up to 93.1397945511389% of the total GPU memory. INFO: Initializing variables... INFO: No previous checkpoint detected, training from scratch. Fatal Python error: Segmentation fault

Thread 0x00007fd78cff9700 (most recent call first): File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/threading.py", line 302 in wait File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/queue.py", line 170 in get File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/threading.py", line 932 in _bootstrap_inner File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007fd9b5258340 (most recent call first): File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1441 in _call_tf_sessionrun File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1349 in _run_fn File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1358 in _do_run File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1179 in _run File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 955 in run File "train.py", line 263 in main File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/absl/app.py", line 258 in _run_main File "/home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/absl/app.py", line 312 in run File "train.py", line 283 in ./reproduce_shapenet_autoencoder.sh: line 50: 1295263 Segmentation fault (core dumped) python train.py --batch_size 24 --experiment_name shapenet-ldif --model_directory $models --model_type "ldif" --dataset_directory $dataset

I have generated the dataset from raw ShapnetCoreV1/03001627 models, by converting .obj file to .ply and then generating watertight .ply file using gaps tools. After that I used the command in the script named reproduce_shapenet_autoencoder.sh to make dataset, everything done successfully. But when I tried to train the net with the dataset, it failed and got the log showed above.

BTW, the enviroment with my computer: ubuntu20.04 with RTX3090, cuda version = 11.1, and I run the code on tensorflow-1.15. Could you give me some advice for this issue? Thank you!

Harvey-Mei avatar Aug 01 '21 03:08 Harvey-Mei

Also, I have successfully run build_gas.sh, gaps_is_installed.sh and build_kernel.sh. with some modification to suit my environment, the scripts showed log as expected and generated all the needed executable files.

Harvey-Mei avatar Aug 09 '21 12:08 Harvey-Mei

Thread 100 "train.py" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ff865fff700 (LWP 1307527)] 0x00007fff7ca95890 in tensorflow::data::experimental::ParallelInterleaveDatasetOp::Dataset::Iterator::EnsureWorkerThreadsStarted(tensorflow::data::IteratorContext*) () from /home/mayo/anaconda3/envs/tf-1.15/lib/python3.8/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so

I got this message when debug with GDB.

Harvey-Mei avatar Aug 09 '21 13:08 Harvey-Mei

I have the same problem Segmentation fault (core dumped) . I ran build_gas.sh successfully but I can't run build_kernel.sh because "unsupported GNU version! gcc versions later than 6 are not supported!". But since it's optional, it shouldn't affect training, right? I'm using Ubuntu20.4 with RTX2080. CUDA Version: 11.3. The env is created with the ymal file.

susuhu avatar Oct 06 '22 15:10 susuhu