petastorm Training stuck in Garbage Collector after first epoch Tensorflow

After running the first epoch my training is stuck in an infinite GC... I kept it running for 18hours and is still running, while all the training should be done in <4hours.

I don't understand and I cannot find any ressource online. It happen since I am using Petastorm distributed dataset for tensorflow.

I really do not know what I could do. Any suggestions please ?

Thank you

10003/10003 [==============================] - ETA: 0s - factorized_top_k/top_1_categorical_accuracy: 0.0012 - factorized_top_k/top_5_categorical_accuracy: 0.0059 - factorized_top_k/top_10_categorical_accuracy: 0.0099 - factorized_top_k/top_50_categorical_accuracy: 0.0320 - factorized_top_k/top_100_categorical_accuracy: 0.0531 - loss: 4949.9486 - regularization_loss: 0.0000e+00 - total_loss: 4949.9486WARNING:tensorflow:Using a while_loop for converting BoostedTreesBucketize

2021-03-02T17:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 13969618K->27949K(28663296K)] 14255296K->313635K(86098944K), 0.0210020 secs] [Times: user=0.09 sys=0.00, real=0.02 secs] 
2021-03-02T17:47:29.138+0000: [Full GC (System.gc()) [PSYoungGen: 27949K->0K(28663296K)] [ParOldGen: 285685K->285689K(57435648K)] 313635K->285689K(86098944K), [Metaspace: 233785K->233785K(251904K)], 0.3277457 secs] [Times: user=1.18 sys=0.00, real=0.33 secs] 

2021-03-02T18:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14582618K->28842K(28669440K)] 14868307K->314539K(86105088K), 0.0199243 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
2021-03-02T18:17:29.137+0000: [Full GC (System.gc()) [PSYoungGen: 28842K->0K(28669440K)] [ParOldGen: 285697K->285696K(57435648K)] 314539K->285696K(86105088K), [Metaspace: 233808K->233808K(251904K)], 0.3398052 secs] [Times: user=1.23 sys=0.00, real=0.34 secs] 

2021-03-02T18:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 14119819K->27684K(28667392K)] 14405515K->313380K(86103040K), 0.0248371 secs] [Times: user=0.06 sys=0.00, real=0.03 secs] 
2021-03-02T18:47:29.142+0000: [Full GC (System.gc()) [PSYoungGen: 27684K->0K(28667392K)] [ParOldGen: 285696K->285671K(57435648K)] 313380K->285671K(86103040K), [Metaspace: 233812K->233812K(251904K)], 0.3002188 secs] [Times: user=0.71 sys=0.00, real=0.30 secs] 

2021-03-02T19:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14235740K->28839K(28672512K)] 14521411K->314518K(86108160K), 0.0200179 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
2021-03-02T19:17:29.137+0000: [Full GC (System.gc()) [PSYoungGen: 28839K->0K(28672512K)] [ParOldGen: 285679K->285716K(57435648K)] 314518K->285716K(86108160K), [Metaspace: 233840K->233840K(251904K)], 0.2681088 secs] [Times: user=0.70 sys=0.00, real=0.27 secs] 

2021-03-02T19:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 14162318K->27884K(28670976K)] 14448035K->313608K(86106624K), 0.0222306 secs] [Times: user=0.09 sys=0.00, real=0.03 secs] 
2021-03-02T19:47:29.139+0000: [Full GC (System.gc()) [PSYoungGen: 27884K->0K(28670976K)] [ParOldGen: 285724K->285709K(57435648K)] 313608K->285709K(86106624K), [Metaspace: 233849K->233849K(251904K)], 0.4094871 secs] [Times: user=1.43 sys=0.00, real=0.41 secs] 

2021-03-02T20:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14255118K->28741K(28675072K)] 14540828K->314459K(86110720K), 0.0215092 secs] [Times: user=0.10 sys=0.00, real=0.03 secs] 
2021-03-02T20:17:29.138+0000: [Full GC (System.gc()) [PSYoungGen: 28741K->0K(28675072K)] [ParOldGen: 285717K->277045K(57435648K)] 314459K->277045K(86110720K), [Metaspace: 233853K->233540K(251904K)], 0.5166519 secs] [Times: user=1.83 sys=0.00, real=0.51 secs] 
...

Mar 03 '21 09:03 anisayari

More information is needed. Perhaps you can provide a small reproducable example a dummy dataset? Which function call results in this infinite loop? Only component that relies on Java GC is an HDFS driver (if you are using Java based HDFS driver). Otherwise, there not sure which GC is emitting these log messages.

Mar 03 '21 22:03 selitvin

It can be because of the tf.keras.layers.experimental.preprocessing.Discretization layer. Replace it with the sklearn.preprocessing.KBinsDiscretizer , outside of the model - and the training will run much quicker.

Jun 06 '21 21:06 mirik123

petastorm petastorm copied to clipboard

Training stuck in Garbage Collector after first epoch Tensorflow

petastorm
petastorm copied to clipboard