transformer icon indicating copy to clipboard operation
transformer copied to clipboard

Training process killed

Open yanshengjia opened this issue 7 years ago • 3 comments

I tried to train transformer model on my own parallel corpus (about 250MB).

But after the graph is constructed, the process is killed before session started.

Graph loaded
WARNING:tensorflow:From train.py:171: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-11-27 12:32:22.021904: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-27 12:32:22.279206: I tensorflow/compiler/xla/service/service.cc:149] XLA service 0x5607d324dc90 executing computations on platform CUDA. Devices:
2018-11-27 12:32:22.279319: I tensorflow/compiler/xla/service/service.cc:157]   StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
2018-11-27 12:32:22.286826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:04:00.0
totalMemory: 11.91GiB freeMemory: 10.98GiB
2018-11-27 12:32:22.286958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2018-11-27 12:32:22.288905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 12:32:22.288978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2018-11-27 12:32:22.289007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2018-11-27 12:32:22.289527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10682 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0)
Killed

Any ideas?

yanshengjia avatar Nov 27 '18 06:11 yanshengjia

H i @yanshengjia
have you solved this problem?

ccnankai avatar Jan 16 '19 12:01 ccnankai

same problem anyone solved?

@yanshengjia @ccnankai @kimdwkimdw @maximedb @Kyubyong

angyee avatar Jul 25 '19 12:07 angyee

Dis you try reducing the model size ?

maximedb avatar Jul 25 '19 12:07 maximedb