sagemaker-debugger
sagemaker-debugger copied to clipboard
inconsistent behavior of script tf_simple.py when hook used with MonitoredSession
If I use the script tf_simple.py and use monitoredSession(hook) , I see in-consistent behavior. Link to script - https://gist.github.com/Vikas-kum/a726aa05f70cbc22da55aac6f9f122d2
Repro - Command to run and reproduce is provided at script(link above) header.
Step=0, Loss=90.67911529541016 gstep:2 Step=1, Loss=97.25459289550781 gstep:3 Step=2, Loss=72.63609313964844 gstep:4 Step=3, Loss=49.64006423950195 gstep:5 Step=4, Loss=30.262378692626953 gstep:6 Step=5, Loss=26.098041534423828 gstep:7 Step=6, Loss=23.234188079833984 gstep:8 Step=7, Loss=14.143218994140625 gstep:9 Step=8, Loss=6.640719413757324 gstep:10 Step=9, Loss=2.9191393852233887 gstep:11 Step=10, Loss=1.1926181316375732 gstep:3 ====> global step 3 again Step=11, Loss=67.19161224365234 gstep:4 =====> loss increased Step=12, Loss=62.6436767578125 gstep:5 Step=13, Loss=44.932037353515625 gstep:6 Step=14, Loss=54.5485954284668 gstep:7 Step=15, Loss=28.61581039428711 gstep:8 Step=16, Loss=25.332355499267578 gstep:9 Step=17, Loss=18.563230514526367 gstep:10 Step=18, Loss=8.643794059753418 gstep:11 Step=19, Loss=5.633042335510254 gstep:12 Step=20, Loss=1.1502041816711426 gstep:2 Step=21, Loss=95.97285461425781 gstep:3 Step=22, Loss=63.6973991394043 gstep:4 Step=23, Loss=45.747554779052734 gstep:5 Step=24, Loss=25.462902069091797 gstep:6 Step=25, Loss=25.730255126953125 gstep:7
But if comment line 1 and use line 2 as show below : #sess = tf.train.MonitoredSession(hooks=[hook]) sess = tf.train.MonitoredSession()
I get correct behavior. Example output: Step=0, Loss=67.61869812011719 gstep:2 Step=1, Loss=109.72452545166016 gstep:3 Step=2, Loss=89.4232177734375 gstep:4 Step=3, Loss=40.550193786621094 gstep:5 Step=4, Loss=46.2119026184082 gstep:6 Step=5, Loss=38.09912109375 gstep:7 Step=6, Loss=21.49539566040039 gstep:8 Step=7, Loss=16.05667495727539 gstep:9 Step=8, Loss=7.0712432861328125 gstep:10 Step=9, Loss=2.7082438468933105 gstep:11 Step=10, Loss=1.6834074258804321 gstep:12 Step=11, Loss=0.2472914457321167 gstep:13 Step=12, Loss=0.0006980320904403925 gstep:14 Step=13, Loss=0.19466720521450043 gstep:15 Step=14, Loss=0.8360849618911743 gstep:16 Step=15, Loss=2.3243532180786133 gstep:17 Step=16, Loss=3.5155558586120605 gstep:18 Step=17, Loss=3.3111186027526855 gstep:19 Step=18, Loss=4.183402061462402 gstep:20 Step=19, Loss=5.629175186157227 gstep:21 Step=20, Loss=6.101352214813232 gstep:22 Step=21, Loss=5.324296951293945 gstep:23 Step=22, Loss=5.301041603088379 gstep:24 Step=23, Loss=4.981998443603516 gstep:25 Step=24, Loss=5.992074489593506 gstep:26 Step=25, Loss=7.53415584564209 gstep:27 Step=26, Loss=4.8035888671875 gstep:28 Step=27, Loss=2.3003716468811035 gstep:29 Step=28, Loss=3.3655598163604736 gstep:30 Step=29, Loss=1.9064804315567017 gstep:31 Step=30, Loss=1.332509160041809 gstep:32 Step=31, Loss=1.2492618560791016 gstep:33 Step=32, Loss=0.3721589744091034 gstep:34 Step=33, Loss=0.20127233862876892 gstep:35 Step=34, Loss=0.039012569934129715 gstep:36 Step=35, Loss=2.4094073523883708e-05 gstep:37 Step=36, Loss=0.03809528425335884 gstep:38 Step=37, Loss=0.10105834901332855 gstep:39 Step=38, Loss=0.35051339864730835 gstep:40 Step=39, Loss=0.33885806798934937 gstep:41 Step=40, Loss=0.5717775821685791 gstep:42 Step=41, Loss=0.5270355343818665 gstep:43
I tried with tensorflow 1.15.0 & tensorflow 1.13.1