neuraltalk2 Segmentation fault on Jetson TX1 during training

Segmentation fault on Jetson TX1 during training

Open ZahlGraf opened this issue 8 years ago • 1 comments

Hi,

since I did not get the pretrained models running, I tried to train my own model on Jetson TX1. Unfortunately the training stops due to a Segmentation fault:

th train.lua -input_h5 coco/cocotalk.h5 -input_json coco/cocotalk.json -cnn_proto cnn_model/VGG_ILSVRC_16_layers_deploy.prototxt -cnn_model cnn_model/VGG_ILSVRC_16_layers.caffemodel -max_iters 1 -batch_size 1 -language_eval 1

DataLoader loading json file: 	coco/cocotalk.json	
vocab size is 9567	
DataLoader loading h5 file: 	coco/cocotalk.h5	
read 123287 images of size 3x256x256	
max sequence length in data is 16	
assigned 113287 images to split train	
assigned 5000 images to split val	
assigned 5000 images to split test	
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432081
Successfully loaded cnn_model/VGG_ILSVRC_16_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
converting first layer conv filters from BGR to RGB...	
Segmentation fault

Sometimes the Segmentation fault happens earlier (after loaded the cnn-model). I have the feeling, that this is related to a out of memory issue, since shortly before the Fault appears the RAM usage was over 90%. However normally torch starts to use the swap file, when running out of memory, this does not happen here...

I also tried to train on CPU (in case the GPU memory cannot be swapped), but this does also not help.

Has anyone already tried to train this on Jetson TX1?

Jan 02 '17 10:01 ZahlGraf

I wouldn't recommend training on edge devices like TX1. Generally speaking, train on cloud and run on edge (if possible)...

Mar 25 '18 20:03 kaisark

neuraltalk2 neuraltalk2 copied to clipboard

Segmentation fault on Jetson TX1 during training

neuraltalk2
neuraltalk2 copied to clipboard