neuraltalk2
neuraltalk2 copied to clipboard
Segmentation fault on Jetson TX1 during training
Hi,
since I did not get the pretrained models running, I tried to train my own model on Jetson TX1. Unfortunately the training stops due to a Segmentation fault:
th train.lua -input_h5 coco/cocotalk.h5 -input_json coco/cocotalk.json -cnn_proto cnn_model/VGG_ILSVRC_16_layers_deploy.prototxt -cnn_model cnn_model/VGG_ILSVRC_16_layers.caffemodel -max_iters 1 -batch_size 1 -language_eval 1
DataLoader loading json file: coco/cocotalk.json
vocab size is 9567
DataLoader loading h5 file: coco/cocotalk.h5
read 123287 images of size 3x256x256
max sequence length in data is 16
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432081
Successfully loaded cnn_model/VGG_ILSVRC_16_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
converting first layer conv filters from BGR to RGB...
Segmentation fault
Sometimes the Segmentation fault happens earlier (after loaded the cnn-model). I have the feeling, that this is related to a out of memory issue, since shortly before the Fault appears the RAM usage was over 90%. However normally torch starts to use the swap file, when running out of memory, this does not happen here...
I also tried to train on CPU (in case the GPU memory cannot be swapped), but this does also not help.
Has anyone already tried to train this on Jetson TX1?
I wouldn't recommend training on edge devices like TX1. Generally speaking, train on cloud and run on edge (if possible)...