petastorm
petastorm copied to clipboard
Training stuck in Garbage Collector after first epoch Tensorflow
After running the first epoch my training is stuck in an infinite GC... I kept it running for 18hours and is still running, while all the training should be done in <4hours.
I don't understand and I cannot find any ressource online. It happen since I am using Petastorm distributed dataset for tensorflow.
I really do not know what I could do. Any suggestions please ?
Thank you
10003/10003 [==============================] - ETA: 0s - factorized_top_k/top_1_categorical_accuracy: 0.0012 - factorized_top_k/top_5_categorical_accuracy: 0.0059 - factorized_top_k/top_10_categorical_accuracy: 0.0099 - factorized_top_k/top_50_categorical_accuracy: 0.0320 - factorized_top_k/top_100_categorical_accuracy: 0.0531 - loss: 4949.9486 - regularization_loss: 0.0000e+00 - total_loss: 4949.9486WARNING:tensorflow:Using a while_loop for converting BoostedTreesBucketize
2021-03-02T17:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 13969618K->27949K(28663296K)] 14255296K->313635K(86098944K), 0.0210020 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]
2021-03-02T17:47:29.138+0000: [Full GC (System.gc()) [PSYoungGen: 27949K->0K(28663296K)] [ParOldGen: 285685K->285689K(57435648K)] 313635K->285689K(86098944K), [Metaspace: 233785K->233785K(251904K)], 0.3277457 secs] [Times: user=1.18 sys=0.00, real=0.33 secs]
2021-03-02T18:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14582618K->28842K(28669440K)] 14868307K->314539K(86105088K), 0.0199243 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
2021-03-02T18:17:29.137+0000: [Full GC (System.gc()) [PSYoungGen: 28842K->0K(28669440K)] [ParOldGen: 285697K->285696K(57435648K)] 314539K->285696K(86105088K), [Metaspace: 233808K->233808K(251904K)], 0.3398052 secs] [Times: user=1.23 sys=0.00, real=0.34 secs]
2021-03-02T18:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 14119819K->27684K(28667392K)] 14405515K->313380K(86103040K), 0.0248371 secs] [Times: user=0.06 sys=0.00, real=0.03 secs]
2021-03-02T18:47:29.142+0000: [Full GC (System.gc()) [PSYoungGen: 27684K->0K(28667392K)] [ParOldGen: 285696K->285671K(57435648K)] 313380K->285671K(86103040K), [Metaspace: 233812K->233812K(251904K)], 0.3002188 secs] [Times: user=0.71 sys=0.00, real=0.30 secs]
2021-03-02T19:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14235740K->28839K(28672512K)] 14521411K->314518K(86108160K), 0.0200179 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
2021-03-02T19:17:29.137+0000: [Full GC (System.gc()) [PSYoungGen: 28839K->0K(28672512K)] [ParOldGen: 285679K->285716K(57435648K)] 314518K->285716K(86108160K), [Metaspace: 233840K->233840K(251904K)], 0.2681088 secs] [Times: user=0.70 sys=0.00, real=0.27 secs]
2021-03-02T19:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 14162318K->27884K(28670976K)] 14448035K->313608K(86106624K), 0.0222306 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
2021-03-02T19:47:29.139+0000: [Full GC (System.gc()) [PSYoungGen: 27884K->0K(28670976K)] [ParOldGen: 285724K->285709K(57435648K)] 313608K->285709K(86106624K), [Metaspace: 233849K->233849K(251904K)], 0.4094871 secs] [Times: user=1.43 sys=0.00, real=0.41 secs]
2021-03-02T20:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14255118K->28741K(28675072K)] 14540828K->314459K(86110720K), 0.0215092 secs] [Times: user=0.10 sys=0.00, real=0.03 secs]
2021-03-02T20:17:29.138+0000: [Full GC (System.gc()) [PSYoungGen: 28741K->0K(28675072K)] [ParOldGen: 285717K->277045K(57435648K)] 314459K->277045K(86110720K), [Metaspace: 233853K->233540K(251904K)], 0.5166519 secs] [Times: user=1.83 sys=0.00, real=0.51 secs]
...
More information is needed. Perhaps you can provide a small reproducable example a dummy dataset? Which function call results in this infinite loop? Only component that relies on Java GC is an HDFS driver (if you are using Java based HDFS driver). Otherwise, there not sure which GC is emitting these log messages.
It can be because of the tf.keras.layers.experimental.preprocessing.Discretization
layer. Replace it with the sklearn.preprocessing.KBinsDiscretizer
, outside of the model - and the training will run much quicker.