tapas Zero loss when fine-tuning already fine-tuned TAPAS on custom data (both PyTorch and Tensorflow)

Zero loss when fine-tuning already fine-tuned TAPAS on custom data (both PyTorch and Tensorflow)

Open NielsRogge opened this issue 3 years ago • 8 comments

TDLR: it seems I can properly fine-tune TAPAS on custom data when the classification heads are randomly initialized, but not when I'm further fine-tuning tapas_wtq_wikisql_sqa_inter_masklm_base_reset. I am experiencing this both with the official Tensorflow implementation from this repository as well as my PyTorch implementation. Below I explain this in more detail - and I would really appreciate your help.

For Tensorflow, see the output of the last cell in this notebook - note that you cannot run this notebook due to Google Drive dependencies. For PyTorch, you can reproduce it in this notebook.

This originated by fine-tuning my PyTorch implementation on 8 examples from the WTQ test set, to see if it's able to overfit them (as this indicates whether or not everything works fine, see this). I fine-tune with a batch-size of 1 (i.e. by giving one example to the model at a time in each forward pass).

When testing with randomly initialized aggregation and cell selection heads (i.e. loading the model with the weights of tapas_inter_masklm_base_reset), this seems to work fine (you can clearly see the loss going down to zero) - the loss is oftentimes 0.0 because the answer_loss is > answer_loss_cutoff:

Example: 0
Loss: 4.5615763664245605
Example: 1
Loss: 0.0
Example: 2
Loss: 3.652130603790283
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.0
Example: 7
Loss: 0.0
Example: 0
Loss: 3.242696762084961
Example: 1
Loss: 0.0
Example: 2
Loss: 2.7367563247680664
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 1.6930313110351562
Example: 7
Loss: 1.2307015657424927
Example: 0
Loss: 2.7896220684051514
Example: 1
Loss: 0.0
Example: 2
Loss: 2.4945666790008545
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 1.1861155033111572
Example: 7
Loss: 0.0
Example: 0
Loss: 2.6183369159698486
Example: 1
Loss: 0.0
Example: 2
Loss: 2.445152997970581
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.7637995481491089
Example: 7
Loss: 0.0
Example: 0
Loss: 2.4819650650024414
Example: 1
Loss: 0.0
Example: 2
Loss: 2.193347692489624
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.5648488998413086
Example: 7
Loss: 0.0
Example: 0
Loss: 2.345994234085083
Example: 1
Loss: 0.0
Example: 2
Loss: 2.023991823196411
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.34213101863861084
Example: 7
Loss: 0.0
Example: 0
Loss: 2.2398712635040283
Example: 1
Loss: 0.0
Example: 2
Loss: 1.8987535238265991
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.08033487945795059
Example: 7
Loss: 0.0
Example: 0
Loss: 2.157059669494629
Example: 1
Loss: 0.0
Example: 2
Loss: 1.7860664129257202
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.22179166972637177
Example: 7
Loss: 0.0
Example: 0
Loss: 2.0470402240753174
Example: 1
Loss: 0.0
Example: 2
Loss: 1.657584547996521
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.08463436365127563
Example: 7
Loss: 0.0
Example: 0
Loss: 1.8794533014297485
Example: 1
Loss: 0.0
Example: 2
Loss: 1.5687501430511475
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.0428503081202507
Example: 7
Loss: 0.0
Example: 0
Loss: 1.7972699403762817
Example: 1
Loss: 0.0
Example: 2
Loss: 1.5882041454315186
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.25950801372528076
Example: 7
Loss: 0.0
Example: 0
Loss: 2.403681993484497
Example: 1
Loss: 0.0
Example: 2
Loss: 1.8372855186462402
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.19665803015232086
Example: 7
Loss: 0.0
Example: 0
Loss: 6.208915710449219
Example: 1
Loss: 0.0
Example: 2
Loss: 1.7333526611328125
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.20829634368419647
Example: 7
Loss: 0.0
Example: 0
Loss: 2.4905076026916504
Example: 1
Loss: 0.0
Example: 2
Loss: 1.5237752199172974
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 2.8046045303344727
Example: 6
Loss: 0.19047576189041138
Example: 7
Loss: 0.0
Example: 0
Loss: 2.147810220718384
Example: 1
Loss: 3.3594565391540527
Example: 2
Loss: 1.2260363101959229
Example: 3
Loss: 2.4119818210601807
Example: 4
Loss: 3.0491719245910645
Example: 5
Loss: 2.6936562061309814
Example: 6
Loss: 0.25881990790367126
Example: 7
Loss: 2.8956243991851807
Example: 0
Loss: 1.8915693759918213
Example: 1
Loss: 2.4552807807922363
Example: 2
Loss: 1.2063438892364502
Example: 3
Loss: 1.9171514511108398
Example: 4
Loss: 2.8656418323516846
Example: 5
Loss: 2.5741631984710693
Example: 6
Loss: 0.04491106793284416
Example: 7
Loss: 2.08845853805542
Example: 0
Loss: 1.6714025735855103
Example: 1
Loss: 2.204368829727173
Example: 2
Loss: 1.0228660106658936
Example: 3
Loss: 1.4084817171096802
Example: 4
Loss: 2.560169219970703
Example: 5
Loss: 1.9598442316055298
Example: 6
Loss: 0.16072653234004974
Example: 7
Loss: 1.8947786092758179
Example: 0
Loss: 1.438791036605835
Example: 1
Loss: 1.651200771331787
Example: 2
Loss: 0.7419790029525757
Example: 3
Loss: 0.9808005690574646
Example: 4
Loss: 2.432058334350586
Example: 5
Loss: 1.319388508796692
Example: 6
Loss: 0.14465902745723724
Example: 7
Loss: 1.6689748764038086
Example: 0
Loss: 1.1740055084228516
Example: 1
Loss: 1.189043641090393
Example: 2
Loss: 0.5398354530334473
Example: 3
Loss: 0.607149064540863
Example: 4
Loss: 2.23201060295105
Example: 5
Loss: 2.123021125793457
Example: 6
Loss: 0.06416704505681992
Example: 7
Loss: 1.378097414970398
Example: 0
Loss: 0.8687064051628113
Example: 1
Loss: 1.2080097198486328
Example: 2
Loss: 0.3965621292591095
Example: 3
Loss: 0.39049506187438965
Example: 4
Loss: 2.1020426750183105
Example: 5
Loss: 0.9800363183021545
Example: 6
Loss: 0.18748190999031067
Example: 7
Loss: 1.104522705078125
Example: 0
Loss: 0.6241241097450256
Example: 1
Loss: 0.5567145943641663
Example: 2
Loss: 0.29804250597953796
Example: 3
Loss: 0.26488810777664185
Example: 4
Loss: 1.8813974857330322
Example: 5
Loss: 0.7125007510185242
Example: 6
Loss: 0.10249417275190353
Example: 7
Loss: 0.7685993313789368
Example: 0
Loss: 0.36037611961364746
Example: 1
Loss: 0.34420275688171387
Example: 2
Loss: 0.1963767260313034
Example: 3
Loss: 0.19287444651126862
Example: 4
Loss: 1.7086113691329956
Example: 5
Loss: 0.410658597946167
Example: 6
Loss: 0.05394325032830238
Example: 7
Loss: 0.6112110018730164
Example: 0
Loss: 0.23545387387275696
Example: 1
Loss: 0.2360788881778717
Example: 2
Loss: 0.28410977125167847
Example: 3
Loss: 0.13694611191749573
Example: 4
Loss: 1.570082187652588
Example: 5
Loss: 0.26734864711761475
Example: 6
Loss: 0.014293872751295567
Example: 7
Loss: 0.34028884768486023
Example: 0
Loss: 0.17607349157333374
Example: 1
Loss: 0.17184586822986603
Example: 2
Loss: 0.17672620713710785
Example: 3
Loss: 0.10316784679889679
Example: 4
Loss: 1.4245171546936035
Example: 5
Loss: 0.21199432015419006
Example: 6
Loss: 0.07465721666812897
Example: 7
Loss: 0.6918496489524841
Example: 0
Loss: 0.14367932081222534
Example: 1
Loss: 0.14142344892024994
Example: 2
Loss: 0.10016531497240067
Example: 3
Loss: 0.08295570313930511
Example: 4
Loss: 1.2957371473312378
Example: 5
Loss: 0.17519520223140717
Example: 6
Loss: 0.017194071784615517
Example: 7
Loss: 0.547325611114502
Example: 0
Loss: 0.11085864156484604
Example: 1
Loss: 0.11308177560567856
Example: 2
Loss: 0.07969305664300919
Example: 3
Loss: 0.06749742478132248
Example: 4
Loss: 1.172611951828003
Example: 5
Loss: 0.1417296975851059
Example: 6
Loss: 0.1589016169309616
Example: 7
Loss: 0.630765438079834
Example: 0
Loss: 0.09442556649446487
Example: 1
Loss: 0.10788251459598541
Example: 2
Loss: 0.0669771134853363
Example: 3
Loss: 0.057267241179943085
Example: 4
Loss: 0.5829395055770874
Example: 5
Loss: 0.1244523748755455
Example: 6
Loss: 0.02626189962029457
Example: 7
Loss: 0.0
Example: 0
Loss: 0.09442073106765747
Example: 1
Loss: 0.09067925065755844
Example: 2
Loss: 0.060158804059028625
Example: 3
Loss: 0.04978378117084503
Example: 4
Loss: 0.0
Example: 5
Loss: 0.11823725700378418
Example: 6
Loss: 0.03969506174325943
Example: 7
Loss: 0.5438536405563354
Example: 0
Loss: 0.08133308589458466
Example: 1
Loss: 0.07996537536382675
Example: 2
Loss: 0.05779488757252693
Example: 3
Loss: 0.04926508665084839
Example: 4
Loss: 0.0
Example: 5
Loss: 0.10455486923456192
Example: 6
Loss: 0.044737640768289566
Example: 7
Loss: 0.03327849879860878
Example: 0
Loss: 0.08570516854524612
Example: 1
Loss: 0.07188469171524048
Example: 2
Loss: 0.0565500482916832
Example: 3
Loss: 0.045313283801078796
Example: 4
Loss: 0.0
Example: 5
Loss: 0.09494125843048096
Example: 6
Loss: 0.1099197268486023
Example: 7
Loss: 0.0
Example: 0
Loss: 0.06661611795425415
Example: 1
Loss: 0.062460556626319885
Example: 2
Loss: 0.049616992473602295
Example: 3
Loss: 0.03886038437485695
Example: 4
Loss: 0.0
Example: 5
Loss: 0.08741049468517303
Example: 6
Loss: 0.09175319224596024
Example: 7
Loss: 0.0
Example: 0
Loss: 0.059624578803777695
Example: 1
Loss: 0.05716964229941368
Example: 2
Loss: 0.044572461396455765
Example: 3
Loss: 0.035585496574640274
Example: 4
Loss: 0.0
Example: 5
Loss: 0.07967197895050049
Example: 6
Loss: 0.05738102272152901
Example: 7
Loss: 0.0
Example: 0
Loss: 0.05318525433540344
(and so on)

However, when I try to fine-tune the model with already fine-tuned classification heads (i.e. tapas_wtq_wikisql_sqa_inter_masklm_base_reset) on the 8 examples, then I get the following:

Epoch: 0
Example: 0
Loss: 47.869815826416016
Example: 1
Loss: 0.09098512679338455
Example: 2
Loss: 0.0
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.0
Example: 7
Loss: 0.0
Epoch: 1
Example: 0
Loss: 0.0
Example: 1
Loss: 0.0
Example: 2
Loss: 0.0
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.0
Example: 7
Loss: 0.0
Epoch: 2
Example: 0
Loss: 0.0
Example: 1
Loss: 0.0
Example: 2
Loss: 0.0
Example: 3
Loss: 0.0
Example: 4
Loss: 0.0
Example: 5
Loss: 0.0
Example: 6
Loss: 0.0
Example: 7
Loss: 4.081819497514516e-06
Epoch: 3
Example: 0
Loss: 0.0
Example: 1
Loss: 0.0
Example: 2
Loss: 0.0
Example: 3
(and so on)

In other words, the loss is really high for the first example, and then stays zero for all other examples. So I investigated this a bit by printing some intermediate values:

Epoch: 0
Example: 0
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([0.], device='cuda:0')
Selection loss per example: tensor([5.2395], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([42.6303], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([42.6303], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 47.869815826416016
Example: 1
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([4.5586], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.0910], device='cuda:0', dtype=torch.float64,
       grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.0910], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.0910], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.09098512679338455
Example: 2
Aggregation_ops_total_mass: tensor([3.9329e-10], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([0.], device='cuda:0')
Selection loss per example: tensor([-0.], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 3
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([13.9304], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 4
Aggregation_ops_total_mass: tensor([1.0000], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([5.9618], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([5.9605e-08], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([8.2338], device='cuda:0', dtype=torch.float64,
       grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([8.2338], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([0.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 5
Aggregation_ops_total_mass: tensor([1.0000], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([11.4647], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([5.9605e-08], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([51.4631], device='cuda:0', dtype=torch.float64,
       grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([51.4631], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([0.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 6
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([18.2330], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
(and so on)

In other words, it looks like the answer_loss_cutoff hyperparameter is preventing the answer loss to be incorporated. So I tried to fix this by setting answer_loss_cutoff to None, but this is what I get then:

Epoch: 0
Example: 0
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([0.], device='cuda:0')
Selection loss per example: tensor([5.2395], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([42.6303], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([42.6303], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 47.869815826416016
Example: 1
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([4.5586], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.0910], device='cuda:0', dtype=torch.float64,
       grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.0910], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.0910], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.09098512679338455
Example: 2
Aggregation_ops_total_mass: tensor([3.9329e-10], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([0.], device='cuda:0')
Selection loss per example: tensor([-0.], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 3
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([13.9304], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 4
Aggregation_ops_total_mass: tensor([1.0000], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([5.9618], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([5.9605e-08], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([8.2338], device='cuda:0', dtype=torch.float64,
       grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([8.2338], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([8.2338], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 8.233847618103027
Example: 5
Aggregation_ops_total_mass: tensor([1.0000], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([11.4648], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([5.9605e-08], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([51.4631], device='cuda:0', dtype=torch.float64,
       grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([51.4631], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([51.4631], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 51.463111877441406
Example: 6
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([18.8919], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 7
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([45.4518], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Epoch: 1
Example: 0
Aggregation_ops_total_mass: tensor([3.4419e-10], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([0.], device='cuda:0')
Selection loss per example: tensor([-0.], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 1
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([7.3020], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 2
Aggregation_ops_total_mass: tensor([4.2981e-10], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([0.], device='cuda:0')
Selection loss per example: tensor([-0.], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 3
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([15.2798], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
Example: 4
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([6.3805], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([8.1124], device='cuda:0', dtype=torch.float64,
       grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([8.1124], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([8.1124], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 8.112383842468262
Example: 5
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([2.2074], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([53.1968], device='cuda:0', dtype=torch.float64,
       grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([53.1968], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([53.1968], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 53.19682312011719
Example: 6
Aggregation_ops_total_mass: tensor([1.], device='cuda:0', grad_fn=<SumBackward1>)
Aggregate mask: tensor([1.], device='cuda:0')
Selection loss per example: tensor([19.0292], device='cuda:0', grad_fn=<AddBackward0>)
Per example additional loss (only for cell selection examples): tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Per example answer loss (Huber loss): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<SWhereBackward>)
Per example answer loss (scaled): tensor([0.], device='cuda:0', dtype=torch.float64, grad_fn=<MulBackward0>)
Large answer loss mask tensor([1.], device='cuda:0')
Per example additional loss with answer loss: tensor([0.], device='cuda:0', grad_fn=<MulBackward0>)
Loss: 0.0
(and so on)

=> I always get the same losses for the same examples, and this doesn't change, and the same thing happens for the Tensorflow implementation. The only possible explanation for a loss of zero would be that the model is perfectly able to predict the answer coordinates and aggregation operator, but this doesn't seem to be the case. As the backward pass of my PyTorch implementation seems to work fine when using randomly initialized classification heads, I suspect that the hyperparameter settings must be set really carefully for the already-finetuned WTQ model in order to properly fine-tune it further on custom data. Do you have an advice to fix this, or do you suspect a mistake in the backward pass? @thomasmueller-google @eisenjulian

Nov 26 '20 14:11 NielsRogge

Sorry for the late reply!

It doesn't seem there is anything wrong with your implementation.

We did observe in the past that the WTQ models are sensitive to the hyper-parameters whereas SQA and TabFact models seem to be much more robust.

This is probably caused by the L2-loss which is of-course somewhat brittle. For example, it does depend on the numeric value of the answers (1 vs 10,000) which doesn't really make sense but we weren't able to improve on this with a normalized version of the loss.

Nov 30 '20 11:11 ghost

Ok, so it's better to fine-tune with randomly initialized classification heads?

Dec 01 '20 15:12 NielsRogge

Hi Niels, I think it could make sense to increase or remove entirely the answer_loss_cutoff as you have done when fine-tuning. Specially in a single batch experiment as you are running the effect of having many things outside the boundary is of course very detrimental of training. The point of the flag was to allow the model to disregard noisy example while focusing on the easier ones, but the specific value was tuned for WTQ and could be brittle for other tasks. If in doubt make it larger.

Regarding your logs, I don't quite understand why the loss would be zero when you disabled answer_loss_cutoff. It certainly happens that the so called answer loss will be zero, when for instance we have a string answer and we don't try to build an aggregation, but are you sure that what you are logging as loss is actually the full loss? I would expect you have a loss from the cell selection component in that case, but not the rest. If it is the full loss, then it sounds like there's a bug somewhere and we should investigate. Alternatively you can also log the gradient norm in the optimizer code to make sure the weights are being updated.

Dec 08 '20 12:12 eisenjulian

Hi @eisenjulian, I've ran this experiment again with the official TF implementation. I've created a separate branch called train_on_wtq_batch which I use to train on 8 examples from the WTQ test set for 100 extra steps (batch size = 1), set the answer_loss_cutoff hyperparameter to 0.0 and print out the loss in each forward pass. I'm experiencing the same behaviour: the total loss stays zero.

The notebook to reproduce is here - however I'm hosting the data of WTQ in my personal Drive. The output of the last cell looks like this:

(...)
I1208 15:49:31.549492 140706150143872 tpu_estimator.py:2351] examples/sec: 2.46821
total_loss[0]
INFO:tensorflow:global_step/sec: 2.42312
I1208 15:49:31.961972 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.42312
INFO:tensorflow:examples/sec: 2.42312
I1208 15:49:31.962343 140706150143872 tpu_estimator.py:2351] examples/sec: 2.42312
total_loss[0]
INFO:tensorflow:global_step/sec: 2.41582
I1208 15:49:32.375930 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.41582
INFO:tensorflow:examples/sec: 2.41582
I1208 15:49:32.376195 140706150143872 tpu_estimator.py:2351] examples/sec: 2.41582
total_loss[17.1046143]
INFO:tensorflow:global_step/sec: 2.44206
I1208 15:49:32.785450 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44206
INFO:tensorflow:examples/sec: 2.44206
I1208 15:49:32.785837 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44206
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45815
I1208 15:49:33.192174 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45815
INFO:tensorflow:examples/sec: 2.45815
I1208 15:49:33.192434 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45815
total_loss[0]
INFO:tensorflow:global_step/sec: 2.41692
I1208 15:49:33.605998 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.41692
INFO:tensorflow:examples/sec: 2.41692
I1208 15:49:33.606318 140706150143872 tpu_estimator.py:2351] examples/sec: 2.41692
total_loss[0]
INFO:tensorflow:global_step/sec: 2.46624
I1208 15:49:34.011403 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.46624
INFO:tensorflow:examples/sec: 2.46624
I1208 15:49:34.011696 140706150143872 tpu_estimator.py:2351] examples/sec: 2.46624
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45328
I1208 15:49:34.419131 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45328
INFO:tensorflow:examples/sec: 2.45328
I1208 15:49:34.419646 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45328
total_loss[1.3309676e-34]
INFO:tensorflow:global_step/sec: 2.44779
I1208 15:49:34.827670 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44779
INFO:tensorflow:examples/sec: 2.44779
I1208 15:49:34.827971 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44779
total_loss[0]
INFO:tensorflow:global_step/sec: 2.41831
I1208 15:49:35.241099 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.41831
INFO:tensorflow:examples/sec: 2.41831
I1208 15:49:35.241382 140706150143872 tpu_estimator.py:2351] examples/sec: 2.41831
total_loss[4.83032228e-29]
INFO:tensorflow:global_step/sec: 2.42138
I1208 15:49:35.654042 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.42138
INFO:tensorflow:examples/sec: 2.42138
I1208 15:49:35.654526 140706150143872 tpu_estimator.py:2351] examples/sec: 2.42138
total_loss[6.05394291e-29]
INFO:tensorflow:global_step/sec: 2.45249
I1208 15:49:36.061813 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45249
INFO:tensorflow:examples/sec: 2.45249
I1208 15:49:36.062072 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45249
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45447
I1208 15:49:36.469217 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45447
INFO:tensorflow:examples/sec: 2.45447
I1208 15:49:36.469766 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45447
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45503
I1208 15:49:36.876554 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45503
INFO:tensorflow:examples/sec: 2.45503
I1208 15:49:36.876927 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45503
total_loss[3.2682568e-31]
INFO:tensorflow:global_step/sec: 2.41603
I1208 15:49:37.290464 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.41603
INFO:tensorflow:examples/sec: 2.41603
I1208 15:49:37.290872 140706150143872 tpu_estimator.py:2351] examples/sec: 2.41603
total_loss[4.57202e-30]
INFO:tensorflow:global_step/sec: 2.4433
I1208 15:49:37.699784 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.4433
INFO:tensorflow:examples/sec: 2.4433
I1208 15:49:37.700285 140706150143872 tpu_estimator.py:2351] examples/sec: 2.4433
total_loss[0]
INFO:tensorflow:global_step/sec: 2.46623
I1208 15:49:38.105208 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.46623
INFO:tensorflow:examples/sec: 2.46623
I1208 15:49:38.105462 140706150143872 tpu_estimator.py:2351] examples/sec: 2.46623
total_loss[0]
INFO:tensorflow:global_step/sec: 2.44364
I1208 15:49:38.514577 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44364
INFO:tensorflow:examples/sec: 2.44364
I1208 15:49:38.514868 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44364
total_loss[0]
INFO:tensorflow:global_step/sec: 2.43308
I1208 15:49:38.925462 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.43308
INFO:tensorflow:examples/sec: 2.43308
I1208 15:49:38.926001 140706150143872 tpu_estimator.py:2351] examples/sec: 2.43308
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45493
I1208 15:49:39.332854 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45493
INFO:tensorflow:examples/sec: 2.45493
I1208 15:49:39.333174 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45493
total_loss[0]
INFO:tensorflow:global_step/sec: 2.48275
I1208 15:49:39.735595 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.48275
INFO:tensorflow:examples/sec: 2.48275
I1208 15:49:39.736250 140706150143872 tpu_estimator.py:2351] examples/sec: 2.48275
total_loss[0]
INFO:tensorflow:global_step/sec: 2.44986
I1208 15:49:40.143774 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44986
INFO:tensorflow:examples/sec: 2.44986
I1208 15:49:40.144046 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44986
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45857
I1208 15:49:40.550529 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45857
INFO:tensorflow:examples/sec: 2.45857
I1208 15:49:40.550845 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45857
total_loss[0]
INFO:tensorflow:global_step/sec: 2.43196
I1208 15:49:40.961695 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.43196
INFO:tensorflow:examples/sec: 2.43196
I1208 15:49:40.961986 140706150143872 tpu_estimator.py:2351] examples/sec: 2.43196
total_loss[0]
INFO:tensorflow:global_step/sec: 2.43883
I1208 15:49:41.371762 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.43883
INFO:tensorflow:examples/sec: 2.43883
I1208 15:49:41.372053 140706150143872 tpu_estimator.py:2351] examples/sec: 2.43883
total_loss[4.06070204e-32]
INFO:tensorflow:global_step/sec: 2.41037
I1208 15:49:41.786640 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.41037
INFO:tensorflow:examples/sec: 2.41037
I1208 15:49:41.786971 140706150143872 tpu_estimator.py:2351] examples/sec: 2.41037
total_loss[1.75509168e-27]
INFO:tensorflow:global_step/sec: 2.36418
I1208 15:49:42.209610 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.36418
INFO:tensorflow:examples/sec: 2.36418
I1208 15:49:42.209889 140706150143872 tpu_estimator.py:2351] examples/sec: 2.36418
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45828
I1208 15:49:42.616421 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45828
INFO:tensorflow:examples/sec: 2.45828
I1208 15:49:42.616755 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45828
total_loss[0]
INFO:tensorflow:global_step/sec: 2.47483
I1208 15:49:43.020451 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.47483
INFO:tensorflow:examples/sec: 2.47483
I1208 15:49:43.020772 140706150143872 tpu_estimator.py:2351] examples/sec: 2.47483
total_loss[5.12656332e-35]
INFO:tensorflow:global_step/sec: 2.46243
I1208 15:49:43.426619 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.46243
INFO:tensorflow:examples/sec: 2.46243
I1208 15:49:43.426939 140706150143872 tpu_estimator.py:2351] examples/sec: 2.46243
total_loss[0]
INFO:tensorflow:global_step/sec: 2.44043
I1208 15:49:43.836325 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44043
INFO:tensorflow:examples/sec: 2.44043
I1208 15:49:43.836893 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44043
total_loss[0]
INFO:tensorflow:global_step/sec: 2.39467
I1208 15:49:44.253925 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.39467
INFO:tensorflow:examples/sec: 2.39467
I1208 15:49:44.254205 140706150143872 tpu_estimator.py:2351] examples/sec: 2.39467
total_loss[16.2514954]
INFO:tensorflow:global_step/sec: 2.41618
I1208 15:49:44.667895 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.41618
INFO:tensorflow:examples/sec: 2.41618
I1208 15:49:44.668335 140706150143872 tpu_estimator.py:2351] examples/sec: 2.41618
total_loss[0]
INFO:tensorflow:global_step/sec: 2.47514
I1208 15:49:45.071854 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.47514
INFO:tensorflow:examples/sec: 2.47514
I1208 15:49:45.072170 140706150143872 tpu_estimator.py:2351] examples/sec: 2.47514
total_loss[0]
INFO:tensorflow:global_step/sec: 2.46152
I1208 15:49:45.478062 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.46152
INFO:tensorflow:examples/sec: 2.46152
I1208 15:49:45.478361 140706150143872 tpu_estimator.py:2351] examples/sec: 2.46152
total_loss[17.151844]
INFO:tensorflow:global_step/sec: 2.46507
I1208 15:49:45.883772 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.46507
INFO:tensorflow:examples/sec: 2.46507
I1208 15:49:45.884071 140706150143872 tpu_estimator.py:2351] examples/sec: 2.46507
total_loss[0]
INFO:tensorflow:global_step/sec: 2.4135
I1208 15:49:46.298143 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.4135
INFO:tensorflow:examples/sec: 2.4135
I1208 15:49:46.298408 140706150143872 tpu_estimator.py:2351] examples/sec: 2.4135
total_loss[0]
INFO:tensorflow:global_step/sec: 2.38282
I1208 15:49:46.717740 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.38282
INFO:tensorflow:examples/sec: 2.38282
I1208 15:49:46.718333 140706150143872 tpu_estimator.py:2351] examples/sec: 2.38282
total_loss[0]
INFO:tensorflow:global_step/sec: 2.44522
I1208 15:49:47.126657 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44522
INFO:tensorflow:examples/sec: 2.44522
I1208 15:49:47.127137 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44522
total_loss[0]
INFO:tensorflow:global_step/sec: 2.40367
I1208 15:49:47.542857 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.40367
INFO:tensorflow:examples/sec: 2.40367
I1208 15:49:47.544001 140706150143872 tpu_estimator.py:2351] examples/sec: 2.40367
total_loss[41.2822304]
INFO:tensorflow:global_step/sec: 2.44876
I1208 15:49:47.951119 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44876
INFO:tensorflow:examples/sec: 2.44876
I1208 15:49:47.951749 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44876
total_loss[0]
INFO:tensorflow:global_step/sec: 2.42523
I1208 15:49:48.363403 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.42523
INFO:tensorflow:examples/sec: 2.42523
I1208 15:49:48.363876 140706150143872 tpu_estimator.py:2351] examples/sec: 2.42523
total_loss[0]
INFO:tensorflow:global_step/sec: 2.42085
I1208 15:49:48.776496 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.42085
INFO:tensorflow:examples/sec: 2.42085
I1208 15:49:48.777153 140706150143872 tpu_estimator.py:2351] examples/sec: 2.42085
total_loss[0]
INFO:tensorflow:global_step/sec: 2.44123
I1208 15:49:49.186107 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44123
INFO:tensorflow:examples/sec: 2.44123
I1208 15:49:49.186401 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44123
total_loss[1.19209282e-07]
INFO:tensorflow:global_step/sec: 2.46558
I1208 15:49:49.591687 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.46558
INFO:tensorflow:examples/sec: 2.46558
I1208 15:49:49.592011 140706150143872 tpu_estimator.py:2351] examples/sec: 2.46558
total_loss[0]
INFO:tensorflow:global_step/sec: 2.41539
I1208 15:49:50.005700 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.41539
INFO:tensorflow:examples/sec: 2.41539
I1208 15:49:50.006020 140706150143872 tpu_estimator.py:2351] examples/sec: 2.41539
total_loss[6.1054352e-27]
INFO:tensorflow:global_step/sec: 2.44693
I1208 15:49:50.414368 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44693
INFO:tensorflow:examples/sec: 2.44693
I1208 15:49:50.414626 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44693
total_loss[4.52759e-37]
INFO:tensorflow:global_step/sec: 2.43767
I1208 15:49:50.824620 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.43767
INFO:tensorflow:examples/sec: 2.43767
I1208 15:49:50.825123 140706150143872 tpu_estimator.py:2351] examples/sec: 2.43767
total_loss[26.8413811]
INFO:tensorflow:global_step/sec: 2.43375
I1208 15:49:51.235515 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.43375
INFO:tensorflow:examples/sec: 2.43375
I1208 15:49:51.235797 140706150143872 tpu_estimator.py:2351] examples/sec: 2.43375
total_loss[0]
INFO:tensorflow:global_step/sec: 2.47422
I1208 15:49:51.639654 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.47422
INFO:tensorflow:examples/sec: 2.47422
I1208 15:49:51.640185 140706150143872 tpu_estimator.py:2351] examples/sec: 2.47422
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45576
I1208 15:49:52.046900 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45576
INFO:tensorflow:examples/sec: 2.45576
I1208 15:49:52.047162 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45576
total_loss[0]
INFO:tensorflow:global_step/sec: 2.40986
I1208 15:49:52.461847 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.40986
INFO:tensorflow:examples/sec: 2.40986
I1208 15:49:52.462099 140706150143872 tpu_estimator.py:2351] examples/sec: 2.40986
total_loss[0]
INFO:tensorflow:global_step/sec: 2.46421
I1208 15:49:52.867706 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.46421
INFO:tensorflow:examples/sec: 2.46421
I1208 15:49:52.867994 140706150143872 tpu_estimator.py:2351] examples/sec: 2.46421
total_loss[0]
INFO:tensorflow:global_step/sec: 2.454
I1208 15:49:53.275135 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.454
INFO:tensorflow:examples/sec: 2.454
I1208 15:49:53.275629 140706150143872 tpu_estimator.py:2351] examples/sec: 2.454
total_loss[7.27312565]
INFO:tensorflow:global_step/sec: 2.43299
I1208 15:49:53.686182 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.43299
INFO:tensorflow:examples/sec: 2.43299
I1208 15:49:53.686438 140706150143872 tpu_estimator.py:2351] examples/sec: 2.43299
total_loss[9.02826666e-29]
INFO:tensorflow:global_step/sec: 2.45106
I1208 15:49:54.094142 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45106
INFO:tensorflow:examples/sec: 2.45106
I1208 15:49:54.094402 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45106
total_loss[2.74782831e-31]
INFO:tensorflow:global_step/sec: 2.43654
I1208 15:49:54.504575 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.43654
INFO:tensorflow:examples/sec: 2.43654
I1208 15:49:54.505124 140706150143872 tpu_estimator.py:2351] examples/sec: 2.43654
total_loss[0]
INFO:tensorflow:global_step/sec: 2.44979
I1208 15:49:54.912812 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44979
INFO:tensorflow:examples/sec: 2.44979
I1208 15:49:54.913107 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44979
total_loss[0]
INFO:tensorflow:global_step/sec: 2.47013
I1208 15:49:55.317636 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.47013
INFO:tensorflow:examples/sec: 2.47013
I1208 15:49:55.317934 140706150143872 tpu_estimator.py:2351] examples/sec: 2.47013
total_loss[0]
INFO:tensorflow:global_step/sec: 2.48582
I1208 15:49:55.719916 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.48582
INFO:tensorflow:examples/sec: 2.48582
I1208 15:49:55.720449 140706150143872 tpu_estimator.py:2351] examples/sec: 2.48582
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45714
I1208 15:49:56.126896 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45714
INFO:tensorflow:examples/sec: 2.45714
I1208 15:49:56.127464 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45714
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45198
I1208 15:49:56.534749 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45198
INFO:tensorflow:examples/sec: 2.45198
I1208 15:49:56.535021 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45198
total_loss[0]
INFO:tensorflow:global_step/sec: 2.47215
I1208 15:49:56.939184 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.47215
INFO:tensorflow:examples/sec: 2.47215
I1208 15:49:56.939463 140706150143872 tpu_estimator.py:2351] examples/sec: 2.47215
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45076
I1208 15:49:57.347216 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45076
INFO:tensorflow:examples/sec: 2.45076
I1208 15:49:57.347808 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45076
total_loss[0]
INFO:tensorflow:global_step/sec: 2.41735
I1208 15:49:57.761010 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.41735
INFO:tensorflow:examples/sec: 2.41735
I1208 15:49:57.761598 140706150143872 tpu_estimator.py:2351] examples/sec: 2.41735
total_loss[0]
INFO:tensorflow:global_step/sec: 2.4281
I1208 15:49:58.172778 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.4281
INFO:tensorflow:examples/sec: 2.4281
I1208 15:49:58.173115 140706150143872 tpu_estimator.py:2351] examples/sec: 2.4281
total_loss[0]
INFO:tensorflow:global_step/sec: 2.4629
I1208 15:49:58.578867 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.4629
INFO:tensorflow:examples/sec: 2.4629
I1208 15:49:58.579776 140706150143872 tpu_estimator.py:2351] examples/sec: 2.4629
total_loss[0]
INFO:tensorflow:global_step/sec: 2.44592
I1208 15:49:58.987611 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.44592
INFO:tensorflow:examples/sec: 2.44592
I1208 15:49:58.988197 140706150143872 tpu_estimator.py:2351] examples/sec: 2.44592
total_loss[0]
INFO:tensorflow:global_step/sec: 2.39884
I1208 15:49:59.404502 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.39884
INFO:tensorflow:examples/sec: 2.39884
I1208 15:49:59.404785 140706150143872 tpu_estimator.py:2351] examples/sec: 2.39884
total_loss[0]
INFO:tensorflow:global_step/sec: 2.45312
I1208 15:49:59.812155 140706150143872 tpu_estimator.py:2350] global_step/sec: 2.45312
INFO:tensorflow:examples/sec: 2.45312
I1208 15:49:59.812439 140706150143872 tpu_estimator.py:2351] examples/sec: 2.45312
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 50100...
I1208 15:49:59.813356 140706150143872 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 50100...
INFO:tensorflow:Saving checkpoints for 50100 into /content/results/wtq/model/model.ckpt.
I1208 15:49:59.813558 140706150143872 basic_session_run_hooks.py:618] Saving checkpoints for 50100 into /content/results/wtq/model/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 50100...
I1208 15:50:05.740594 140706150143872 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 50100...
INFO:tensorflow:Loss for final step: 0.0.
I1208 15:50:06.008661 140706150143872 estimator.py:350] Loss for final step: 0.0.
INFO:tensorflow:training_loop marked as finished

Dec 08 '20 16:12 NielsRogge

You definitely don't want to set it to zero, since in the code it's checked for None only, that means that using zero out would effectively prohibit the training. I would set to None or something very high. Do the example that you picked all have numeric answers or are some text only?

Dec 09 '20 21:12 eisenjulian

@eisenjulian the examples are a mix of everything: SUM, COUNT and NONE examples (10 examples from the WTQ test set). Setting answer_loss_cutoff to None gives me the following output:

(...)
total_loss[0.00326376595]
INFO:tensorflow:global_step/sec: 4.6112
I1210 08:44:36.827843 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.6112
INFO:tensorflow:examples/sec: 4.6112
I1210 08:44:36.828183 139656620246912 tpu_estimator.py:2351] examples/sec: 4.6112
total_loss[8.23384857]
INFO:tensorflow:global_step/sec: 4.68608
I1210 08:44:37.041295 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68608
INFO:tensorflow:examples/sec: 4.68608
I1210 08:44:37.041517 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68608
total_loss[0.113839947]
INFO:tensorflow:global_step/sec: 4.62256
I1210 08:44:37.257619 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.62256
INFO:tensorflow:examples/sec: 4.62256
I1210 08:44:37.257983 139656620246912 tpu_estimator.py:2351] examples/sec: 4.62256
total_loss[23.1116047]
INFO:tensorflow:global_step/sec: 4.6858
I1210 08:44:37.470990 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.6858
INFO:tensorflow:examples/sec: 4.6858
I1210 08:44:37.471159 139656620246912 tpu_estimator.py:2351] examples/sec: 4.6858
total_loss[1.54702704e-08]
INFO:tensorflow:global_step/sec: 4.68658
I1210 08:44:37.684377 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68658
INFO:tensorflow:examples/sec: 4.68658
I1210 08:44:37.684548 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68658
total_loss[7.98242539e-09]
INFO:tensorflow:global_step/sec: 4.64399
I1210 08:44:37.899690 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.64399
INFO:tensorflow:examples/sec: 4.64399
I1210 08:44:37.899990 139656620246912 tpu_estimator.py:2351] examples/sec: 4.64399
total_loss[5.08280182]
INFO:tensorflow:global_step/sec: 4.67934
I1210 08:44:38.113422 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.67934
INFO:tensorflow:examples/sec: 4.67934
I1210 08:44:38.113769 139656620246912 tpu_estimator.py:2351] examples/sec: 4.67934
total_loss[44.6988525]
INFO:tensorflow:global_step/sec: 4.6363
I1210 08:44:38.329100 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.6363
INFO:tensorflow:examples/sec: 4.6363
I1210 08:44:38.329292 139656620246912 tpu_estimator.py:2351] examples/sec: 4.6363
total_loss[0.984788179]
INFO:tensorflow:global_step/sec: 4.68598
I1210 08:44:38.542495 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68598
INFO:tensorflow:examples/sec: 4.68598
I1210 08:44:38.542679 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68598
total_loss[1.56427035e-07]
INFO:tensorflow:global_step/sec: 4.68309
I1210 08:44:38.756026 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68309
INFO:tensorflow:examples/sec: 4.68309
I1210 08:44:38.756206 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68309
total_loss[2.10645501e-09]
INFO:tensorflow:global_step/sec: 4.68897
I1210 08:44:38.969306 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68897
INFO:tensorflow:examples/sec: 4.68897
I1210 08:44:38.969487 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68897
total_loss[6.59025723e-10]
INFO:tensorflow:global_step/sec: 4.60075
I1210 08:44:39.186649 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.60075
INFO:tensorflow:examples/sec: 4.60075
I1210 08:44:39.186824 139656620246912 tpu_estimator.py:2351] examples/sec: 4.60075
total_loss[0.356237769]
INFO:tensorflow:global_step/sec: 4.65958
I1210 08:44:39.401285 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.65958
INFO:tensorflow:examples/sec: 4.65958
I1210 08:44:39.401463 139656620246912 tpu_estimator.py:2351] examples/sec: 4.65958
total_loss[0]
INFO:tensorflow:global_step/sec: 4.69267
I1210 08:44:39.614380 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.69267
INFO:tensorflow:examples/sec: 4.69267
I1210 08:44:39.614725 139656620246912 tpu_estimator.py:2351] examples/sec: 4.69267
total_loss[9.50814401e-29]
INFO:tensorflow:global_step/sec: 4.6846
I1210 08:44:39.827817 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.6846
INFO:tensorflow:examples/sec: 4.6846
I1210 08:44:39.827995 139656620246912 tpu_estimator.py:2351] examples/sec: 4.6846
total_loss[0.235044]
INFO:tensorflow:global_step/sec: 4.68943
I1210 08:44:40.041072 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68943
INFO:tensorflow:examples/sec: 4.68943
I1210 08:44:40.041280 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68943
total_loss[0]
INFO:tensorflow:global_step/sec: 4.66685
I1210 08:44:40.255381 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.66685
INFO:tensorflow:examples/sec: 4.66685
I1210 08:44:40.255576 139656620246912 tpu_estimator.py:2351] examples/sec: 4.66685
total_loss[0]
INFO:tensorflow:global_step/sec: 4.6584
I1210 08:44:40.470000 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.6584
INFO:tensorflow:examples/sec: 4.6584
I1210 08:44:40.470173 139656620246912 tpu_estimator.py:2351] examples/sec: 4.6584
total_loss[0]
INFO:tensorflow:global_step/sec: 4.69496
I1210 08:44:40.682999 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.69496
INFO:tensorflow:examples/sec: 4.69496
I1210 08:44:40.683170 139656620246912 tpu_estimator.py:2351] examples/sec: 4.69496
total_loss[44.1072731]
INFO:tensorflow:global_step/sec: 4.69153
I1210 08:44:40.896161 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.69153
INFO:tensorflow:examples/sec: 4.69153
I1210 08:44:40.896355 139656620246912 tpu_estimator.py:2351] examples/sec: 4.69153
total_loss[18.3245564]
INFO:tensorflow:global_step/sec: 4.67297
I1210 08:44:41.110239 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.67297
INFO:tensorflow:examples/sec: 4.67297
I1210 08:44:41.110647 139656620246912 tpu_estimator.py:2351] examples/sec: 4.67297
total_loss[15.4472618]
INFO:tensorflow:global_step/sec: 4.68034
I1210 08:44:41.323808 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68034
INFO:tensorflow:examples/sec: 4.68034
I1210 08:44:41.323983 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68034
total_loss[23.5648594]
INFO:tensorflow:global_step/sec: 4.69312
I1210 08:44:41.536899 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.69312
INFO:tensorflow:examples/sec: 4.69312
I1210 08:44:41.537079 139656620246912 tpu_estimator.py:2351] examples/sec: 4.69312
total_loss[0]
INFO:tensorflow:global_step/sec: 4.68613
I1210 08:44:41.750308 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68613
INFO:tensorflow:examples/sec: 4.68613
I1210 08:44:41.750490 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68613
total_loss[2.31592907e-07]
INFO:tensorflow:global_step/sec: 4.64932
I1210 08:44:41.965380 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.64932
INFO:tensorflow:examples/sec: 4.64932
I1210 08:44:41.965564 139656620246912 tpu_estimator.py:2351] examples/sec: 4.64932
total_loss[0.235044]
INFO:tensorflow:global_step/sec: 4.64185
I1210 08:44:42.180805 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.64185
INFO:tensorflow:examples/sec: 4.64185
I1210 08:44:42.181000 139656620246912 tpu_estimator.py:2351] examples/sec: 4.64185
total_loss[0]
INFO:tensorflow:global_step/sec: 4.68253
I1210 08:44:42.394395 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68253
INFO:tensorflow:examples/sec: 4.68253
I1210 08:44:42.394572 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68253
total_loss[5.37112244e-10]
INFO:tensorflow:global_step/sec: 4.64113
I1210 08:44:42.609839 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.64113
INFO:tensorflow:examples/sec: 4.64113
I1210 08:44:42.610212 139656620246912 tpu_estimator.py:2351] examples/sec: 4.64113
total_loss[2.38710356e-07]
INFO:tensorflow:global_step/sec: 4.62368
I1210 08:44:42.826105 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.62368
INFO:tensorflow:examples/sec: 4.62368
I1210 08:44:42.826464 139656620246912 tpu_estimator.py:2351] examples/sec: 4.62368
total_loss[16.120903]
INFO:tensorflow:global_step/sec: 4.66011
I1210 08:44:43.040707 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.66011
INFO:tensorflow:examples/sec: 4.66011
I1210 08:44:43.041275 139656620246912 tpu_estimator.py:2351] examples/sec: 4.66011
total_loss[1.19209901e-07]
INFO:tensorflow:global_step/sec: 4.61284
I1210 08:44:43.257531 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.61284
INFO:tensorflow:examples/sec: 4.61284
I1210 08:44:43.257884 139656620246912 tpu_estimator.py:2351] examples/sec: 4.61284
total_loss[8.23384857]
INFO:tensorflow:global_step/sec: 4.65939
I1210 08:44:43.472110 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.65939
INFO:tensorflow:examples/sec: 4.65939
I1210 08:44:43.472355 139656620246912 tpu_estimator.py:2351] examples/sec: 4.65939
total_loss[16.692667]
INFO:tensorflow:global_step/sec: 4.69309
I1210 08:44:43.685170 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.69309
INFO:tensorflow:examples/sec: 4.69309
I1210 08:44:43.685357 139656620246912 tpu_estimator.py:2351] examples/sec: 4.69309
total_loss[1.44925e-27]
INFO:tensorflow:global_step/sec: 4.67243
I1210 08:44:43.899198 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.67243
INFO:tensorflow:examples/sec: 4.67243
I1210 08:44:43.899525 139656620246912 tpu_estimator.py:2351] examples/sec: 4.67243
total_loss[52.5905876]
INFO:tensorflow:global_step/sec: 4.65913
I1210 08:44:44.113834 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.65913
INFO:tensorflow:examples/sec: 4.65913
I1210 08:44:44.114011 139656620246912 tpu_estimator.py:2351] examples/sec: 4.65913
total_loss[8.76564528e-08]
INFO:tensorflow:global_step/sec: 4.60605
I1210 08:44:44.330943 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.60605
INFO:tensorflow:examples/sec: 4.60605
I1210 08:44:44.331346 139656620246912 tpu_estimator.py:2351] examples/sec: 4.60605
total_loss[0]
INFO:tensorflow:global_step/sec: 4.66316
I1210 08:44:44.545409 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.66316
INFO:tensorflow:examples/sec: 4.66316
I1210 08:44:44.545583 139656620246912 tpu_estimator.py:2351] examples/sec: 4.66316
total_loss[8.9416389e-08]
INFO:tensorflow:global_step/sec: 4.67855
I1210 08:44:44.759125 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.67855
INFO:tensorflow:examples/sec: 4.67855
I1210 08:44:44.759653 139656620246912 tpu_estimator.py:2351] examples/sec: 4.67855
total_loss[43.8649673]
INFO:tensorflow:global_step/sec: 4.64705
I1210 08:44:44.974344 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.64705
INFO:tensorflow:examples/sec: 4.64705
I1210 08:44:44.974513 139656620246912 tpu_estimator.py:2351] examples/sec: 4.64705
total_loss[17.2327785]
INFO:tensorflow:global_step/sec: 4.68377
I1210 08:44:45.187837 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68377
INFO:tensorflow:examples/sec: 4.68377
I1210 08:44:45.188214 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68377
total_loss[0.235044]
INFO:tensorflow:global_step/sec: 4.64717
I1210 08:44:45.403003 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.64717
INFO:tensorflow:examples/sec: 4.64717
I1210 08:44:45.403175 139656620246912 tpu_estimator.py:2351] examples/sec: 4.64717
total_loss[43.3284378]
INFO:tensorflow:global_step/sec: 4.71283
I1210 08:44:45.615188 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.71283
INFO:tensorflow:examples/sec: 4.71283
I1210 08:44:45.615565 139656620246912 tpu_estimator.py:2351] examples/sec: 4.71283
total_loss[1.57954191e-10]
INFO:tensorflow:global_step/sec: 4.63875
I1210 08:44:45.830761 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.63875
INFO:tensorflow:examples/sec: 4.63875
I1210 08:44:45.831134 139656620246912 tpu_estimator.py:2351] examples/sec: 4.63875
total_loss[51.99049]
INFO:tensorflow:global_step/sec: 4.56076
I1210 08:44:46.050055 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.56076
INFO:tensorflow:examples/sec: 4.56076
I1210 08:44:46.050237 139656620246912 tpu_estimator.py:2351] examples/sec: 4.56076
total_loss[15.2884579]
INFO:tensorflow:global_step/sec: 4.68176
I1210 08:44:46.263626 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68176
INFO:tensorflow:examples/sec: 4.68176
I1210 08:44:46.263809 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68176
total_loss[0]
INFO:tensorflow:global_step/sec: 4.65809
I1210 08:44:46.478372 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.65809
INFO:tensorflow:examples/sec: 4.65809
I1210 08:44:46.478807 139656620246912 tpu_estimator.py:2351] examples/sec: 4.65809
total_loss[8.77211832e-08]
INFO:tensorflow:global_step/sec: 4.62402
I1210 08:44:46.694595 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.62402
INFO:tensorflow:examples/sec: 4.62402
I1210 08:44:46.695011 139656620246912 tpu_estimator.py:2351] examples/sec: 4.62402
total_loss[3.61636721e-09]
INFO:tensorflow:global_step/sec: 4.63874
I1210 08:44:46.910142 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.63874
INFO:tensorflow:examples/sec: 4.63874
I1210 08:44:46.910336 139656620246912 tpu_estimator.py:2351] examples/sec: 4.63874
total_loss[1.44098795e-08]
INFO:tensorflow:global_step/sec: 4.61028
I1210 08:44:47.127062 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.61028
INFO:tensorflow:examples/sec: 4.61028
I1210 08:44:47.127280 139656620246912 tpu_estimator.py:2351] examples/sec: 4.61028
total_loss[0.235044]
INFO:tensorflow:global_step/sec: 4.6761
I1210 08:44:47.340920 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.6761
INFO:tensorflow:examples/sec: 4.6761
I1210 08:44:47.341115 139656620246912 tpu_estimator.py:2351] examples/sec: 4.6761
total_loss[50.0172043]
INFO:tensorflow:global_step/sec: 4.63348
I1210 08:44:47.556736 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.63348
INFO:tensorflow:examples/sec: 4.63348
I1210 08:44:47.556941 139656620246912 tpu_estimator.py:2351] examples/sec: 4.63348
total_loss[0]
INFO:tensorflow:global_step/sec: 4.61519
I1210 08:44:47.773420 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.61519
INFO:tensorflow:examples/sec: 4.61519
I1210 08:44:47.773796 139656620246912 tpu_estimator.py:2351] examples/sec: 4.61519
total_loss[0]
INFO:tensorflow:global_step/sec: 4.69027
I1210 08:44:47.986588 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.69027
INFO:tensorflow:examples/sec: 4.69027
I1210 08:44:47.986746 139656620246912 tpu_estimator.py:2351] examples/sec: 4.69027
total_loss[52.7779198]
INFO:tensorflow:global_step/sec: 4.63972
I1210 08:44:48.202164 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.63972
INFO:tensorflow:examples/sec: 4.63972
I1210 08:44:48.202593 139656620246912 tpu_estimator.py:2351] examples/sec: 4.63972
total_loss[0]
INFO:tensorflow:global_step/sec: 4.61809
I1210 08:44:48.418682 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.61809
INFO:tensorflow:examples/sec: 4.61809
I1210 08:44:48.418864 139656620246912 tpu_estimator.py:2351] examples/sec: 4.61809
total_loss[0]
INFO:tensorflow:global_step/sec: 4.66483
I1210 08:44:48.633061 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.66483
INFO:tensorflow:examples/sec: 4.66483
I1210 08:44:48.633248 139656620246912 tpu_estimator.py:2351] examples/sec: 4.66483
total_loss[0.235044]
INFO:tensorflow:global_step/sec: 4.61923
I1210 08:44:48.849674 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.61923
INFO:tensorflow:examples/sec: 4.61923
I1210 08:44:48.849841 139656620246912 tpu_estimator.py:2351] examples/sec: 4.61923
total_loss[52.6546516]
INFO:tensorflow:global_step/sec: 4.6308
I1210 08:44:49.065510 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.6308
INFO:tensorflow:examples/sec: 4.6308
I1210 08:44:49.065698 139656620246912 tpu_estimator.py:2351] examples/sec: 4.6308
total_loss[34.1765747]
INFO:tensorflow:global_step/sec: 4.65532
I1210 08:44:49.280330 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.65532
INFO:tensorflow:examples/sec: 4.65532
I1210 08:44:49.280539 139656620246912 tpu_estimator.py:2351] examples/sec: 4.65532
total_loss[7.70200639e-11]
INFO:tensorflow:global_step/sec: 4.62421
I1210 08:44:49.496577 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.62421
INFO:tensorflow:examples/sec: 4.62421
I1210 08:44:49.497071 139656620246912 tpu_estimator.py:2351] examples/sec: 4.62421
total_loss[28.7156277]
INFO:tensorflow:global_step/sec: 4.45626
I1210 08:44:49.720988 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.45626
INFO:tensorflow:examples/sec: 4.45626
I1210 08:44:49.721240 139656620246912 tpu_estimator.py:2351] examples/sec: 4.45626
total_loss[0.233282626]
INFO:tensorflow:global_step/sec: 4.58855
I1210 08:44:49.938894 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.58855
INFO:tensorflow:examples/sec: 4.58855
I1210 08:44:49.939105 139656620246912 tpu_estimator.py:2351] examples/sec: 4.58855
total_loss[17.4976215]
INFO:tensorflow:global_step/sec: 4.66488
I1210 08:44:50.153287 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.66488
INFO:tensorflow:examples/sec: 4.66488
I1210 08:44:50.154114 139656620246912 tpu_estimator.py:2351] examples/sec: 4.66488
total_loss[19.4025936]
INFO:tensorflow:global_step/sec: 4.61064
I1210 08:44:50.370204 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.61064
INFO:tensorflow:examples/sec: 4.61064
I1210 08:44:50.370769 139656620246912 tpu_estimator.py:2351] examples/sec: 4.61064
total_loss[1.69543199e-13]
INFO:tensorflow:global_step/sec: 4.59321
I1210 08:44:50.587853 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.59321
INFO:tensorflow:examples/sec: 4.59321
I1210 08:44:50.588041 139656620246912 tpu_estimator.py:2351] examples/sec: 4.59321
total_loss[0]
INFO:tensorflow:global_step/sec: 4.72215
I1210 08:44:50.799639 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.72215
INFO:tensorflow:examples/sec: 4.72215
I1210 08:44:50.799831 139656620246912 tpu_estimator.py:2351] examples/sec: 4.72215
total_loss[7.01430974e-31]
INFO:tensorflow:global_step/sec: 4.68856
I1210 08:44:51.012896 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.68856
INFO:tensorflow:examples/sec: 4.68856
I1210 08:44:51.013087 139656620246912 tpu_estimator.py:2351] examples/sec: 4.68856
total_loss[0]
INFO:tensorflow:global_step/sec: 4.65264
I1210 08:44:51.228096 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.65264
INFO:tensorflow:examples/sec: 4.65264
I1210 08:44:51.228484 139656620246912 tpu_estimator.py:2351] examples/sec: 4.65264
total_loss[0.233451784]
INFO:tensorflow:global_step/sec: 4.61969
I1210 08:44:51.444320 139656620246912 tpu_estimator.py:2350] global_step/sec: 4.61969
INFO:tensorflow:examples/sec: 4.61969
I1210 08:44:51.444508 139656620246912 tpu_estimator.py:2351] examples/sec: 4.61969
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 50100...
I1210 08:44:51.445188 139656620246912 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 50100...
INFO:tensorflow:Saving checkpoints for 50100 into /content/results/wtq/model/model.ckpt.
I1210 08:44:51.445340 139656620246912 basic_session_run_hooks.py:618] Saving checkpoints for 50100 into /content/results/wtq/model/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 50100...
I1210 08:44:57.045134 139656620246912 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 50100...
INFO:tensorflow:Loss for final step: 0.23345178.
I1210 08:44:57.266234 139656620246912 estimator.py:350] Loss for final step: 0.23345178.
INFO:tensorflow:training_loop marked as finished
I1210 08:44:57.267120 139656620246912 error_handling.py:115] training_loop marked as finished

Unfortunately, my PyTorch implementation seems to have nearly zero gradients when testing on this WTQ batch (and also setting answer_loss_cutoff to None). Could you perhaps take a look? Does the output seem reasonable to you or do you suspect a bug? https://colab.research.google.com/drive/1ZB680iVUGNKXFwmVBBpZuW9AShEZAeU9?usp=sharing

It's weird, since I just tested again on the same data but with randomly initialized classification heads, and it seems to learn properly (with positive gradients) (as shown above in this thread)- however this is with an answer_loss_cutoff set to 0.66.

Also, is it possible that the answer_loss_cutoff_mask is 0 for examples for which their answer loss is > the cutoff value, rather than 1? As defined here.

Dec 10 '20 08:12 NielsRogge

@NielsRogge Hello, have you found a solution to this problem? I also came across some problems when further fine-tuning the already fine-tuned TAPAS on the WTQ-like dataset.

Mar 08 '21 03:03 arielsho

Were anyone able to solve this issue? I am having the exact same problem.

Sep 22 '21 20:09 AhmedMasryKU

tapas tapas copied to clipboard

Zero loss when fine-tuning already fine-tuned TAPAS on custom data (both PyTorch and Tensorflow)

tapas
tapas copied to clipboard