rasa icon indicating copy to clipboard operation
rasa copied to clipboard

Scheduled Model Regression Test Performance Drops

Open rasabot opened this issue 2 years ago • 0 comments

This PR is automatically created by the Scheduled Model Regression Test workflow. Checkout the Github Action Run here.
---
Description of Problem:
Some test performance scores decreased. Please look at the following table for more details.
Dataset: Carbon Bot, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 1m52s, train: 4m1s, total: 5m53s
0.7942 (0.00) 0.7529 (0.00) 0.5382 (0.00)
BERT + DIET(seq) + ResponseSelector(t2t)
test: 2m56s, train: 5m17s, total: 8m13s
0.8078 (0.00) 0.7787 (0.00) 0.5298 (0.00)
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 2m9s, train: 6m26s, total: 8m35s
0.7903 (0.01) 0.7529 (0.00) 0.5629 (0.00)
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 3m16s, train: 6m13s, total: 9m29s
0.7806 (0.00) 0.7880 (0.00) 0.5430 (-0.02)
Sparse + DIET(bow) + ResponseSelector(bow)
test: 49s, train: 3m18s, total: 4m6s
0.7437 (0.00) 0.7529 (0.00) 0.5166 (-0.06)
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 1m51s, train: 5m13s, total: 7m4s
0.7398 (0.00) 0.7022 (0.00) 0.5166 (-0.01)

Dataset: Hermit, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 3m28s, train: 22m7s, total: 25m35s
0.8987 (0.00) 0.7504 (0.00) no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 4m7s, train: 30m58s, total: 35m5s
0.8717 (0.00) 0.7504 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 1m16s, train: 23m18s, total: 24m33s
0.8299 (-0.00) 0.7504 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 2m2s, train: 19m4s, total: 21m6s
0.8346 (0.00) 0.7562 (-0.00) no data

Dataset: Private 1, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 2m45s, train: 4m50s, total: 7m36s
0.9096 (0.00) 0.9612 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 3m38s, train: 4m45s, total: 8m23s
0.9148 (0.00) 0.9717 (0.00) no data
Spacy + DIET(bow) + ResponseSelector(bow)
test: 45s, train: 3m51s, total: 4m36s
0.8420 (0.00) 0.9574 (0.00) no data
Spacy + DIET(seq) + ResponseSelector(t2t)
test: 1m32s, train: 4m26s, total: 5m58s
0.8524 (0.00) 0.9445 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 40s, train: 4m51s, total: 5m31s
0.8940 (-0.01) 0.9612 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 1m18s, train: 4m25s, total: 5m42s
0.9064 (0.00) 0.9689 (-0.00) no data
Sparse + Spacy + DIET(bow) + ResponseSelector(bow)
test: 53s, train: 5m40s, total: 6m33s
0.8898 (-0.01) 0.9574 (0.00) no data
Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)
test: 1m41s, train: 5m28s, total: 7m8s
0.9002 (0.01) 0.9699 (0.00) no data

Dataset: Private 2, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 2m43s, train: 11m55s, total: 14m37s
0.8745 (0.00) no data no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 2m54s, train: 6m43s, total: 9m36s
0.8830 (0.00) no data no data
Spacy + DIET(bow) + ResponseSelector(bow)
test: 45s, train: 6m36s, total: 7m21s
0.7253 (0.00) no data no data
Spacy + DIET(seq) + ResponseSelector(t2t)
test: 55s, train: 6m53s, total: 7m47s
0.7833 (-0.00) no data no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 40s, train: 5m50s, total: 6m30s
0.8509 (0.00) no data no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 50s, train: 5m44s, total: 6m34s
0.8530 (0.00) no data no data
Sparse + Spacy + DIET(bow) + ResponseSelector(bow)
test: 58s, train: 9m17s, total: 10m14s
0.8562 (-0.01) no data no data
Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)
test: 1m3s, train: 7m42s, total: 8m45s
0.8519 (0.00) no data no data

Dataset: Private 3, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 1m11s, train: 1m16s, total: 2m26s
0.9177 (0.00) no data no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 1m15s, train: 55s, total: 2m10s
0.8436 (0.00) no data no data
Spacy + DIET(bow) + ResponseSelector(bow)
test: 42s, train: 1m5s, total: 1m47s
0.6173 (0.00) no data no data
Spacy + DIET(seq) + ResponseSelector(t2t)
test: 50s, train: 54s, total: 1m43s
0.6255 (0.00) no data no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 39s, train: 1m21s, total: 2m0s
0.8683 (0.00) no data no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 41s, train: 52s, total: 1m32s
0.8642 (0.00) no data no data
Sparse + Spacy + DIET(bow) + ResponseSelector(bow)
test: 44s, train: 1m32s, total: 2m15s
0.8477 (0.00) no data no data
Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)
test: 53s, train: 1m9s, total: 2m1s
0.8601 (0.00) no data no data

Dataset: Sara, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 6m12s, train: 7m19s, total: 13m30s
0.7169 (-0.00) 0.7949 (0.00) 0.7944 (-0.01)
BERT + DIET(seq) + ResponseSelector(t2t)
test: 7m0s, train: 6m12s, total: 13m12s
0.7136 (0.00) 0.7925 (0.00) 0.7783 (0.00)
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 6m59s, train: 11m40s, total: 18m39s
0.6933 (-0.00) 0.7949 (0.00) 0.8047 (0.01)
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 7m51s, train: 9m6s, total: 16m56s
0.7025 (0.00) 0.7831 (-0.01) 0.7860 (0.00)
Sparse + DIET(bow) + ResponseSelector(bow)
test: 1m59s, train: 8m30s, total: 10m28s
0.6692 (-0.00) 0.7949 (0.00) 0.7907 (-0.01)
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 2m46s, train: 6m31s, total: 9m17s
0.6803 (0.00) 0.7692 (0.00) 0.7922 (0.02)
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.1266 (0.00) 0.0000 (0.00) 5m52s 1m22s
Rules + AugMemo 0.9211 (0.00) 0.6644 (0.00) 6m16s 1m47s
Rules + AugMemo + TED 0.9710 (0.00) 0.7534 (0.01) 50m40s 3m27s
Rules + Memo 0.3860 (0.00) 0.1438 (0.00) 6m3s 1m31s
Rules + Memo + TED 0.9496 (0.00) 0.6301 (-0.00) 54m14s 3m37s
Rules + TED 0.9483 (0.00) 0.6370 (0.01) 52m35s 3m28s

Dataset: financial-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 64b1086da6f52db86324ccdd9a768fea07be7186 Configuration repository branch: main

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 32s, train: 1m8s, total: 1m39s
1.0000 (0.00) 0.8333 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 1m1s, train: 1m18s, total: 2m18s
1.0000 (0.00) 0.8333 (0.00) no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 35s, train: 1m16s, total: 1m50s
1.0000 (0.00) 0.8333 (0.00) no data
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 1m3s, train: 1m28s, total: 2m31s
1.0000 (0.00) 0.8800 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 23s, train: 56s, total: 1m18s
0.9643 (0.00) 0.8333 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 52s, train: 1m15s, total: 2m6s
0.9643 (0.00) 0.8800 (0.00) no data
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.7218 (0.00) 0.5417 (0.00) 29s 12s
Rules + AugMemo 1.0000 (0.00) 1.0000 (0.00) 33s 12s
Rules + AugMemo + TED 1.0000 (0.00) 1.0000 (0.00) 8m1s 40s
Rules + Memo 0.9807 (0.00) 0.9167 (0.00) 33s 13s
Rules + Memo + TED 1.0000 (0.00) 1.0000 (0.00) 7m22s 37s
Rules + TED 0.8856 (0.00) 0.6875 (0.00) 7m27s 36s

Dataset: helpdesk-assistant, Dataset repository branch: fix-model-regression-tests (external repository), commit: d83f20e76447faf3c6eb31f6a1bc0576f28408e1 Configuration repository branch: main

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 26s, train: 57s, total: 1m22s
1.0000 (0.00) no data no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 55s, train: 1m10s, total: 2m4s
1.0000 (0.00) no data no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 29s, train: 1m5s, total: 1m33s
1.0000 (0.00) no data no data
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 58s, train: 1m17s, total: 2m14s
1.0000 (0.00) no data no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 18s, train: 49s, total: 1m8s
1.0000 (0.00) no data no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 47s, train: 1m5s, total: 1m51s
1.0000 (0.00) no data no data
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.5714 (0.00) 0.2500 (0.00) 13s 9s
Rules + AugMemo 1.0000 (0.00) 1.0000 (0.00) 15s 10s
Rules + AugMemo + TED 1.0000 (0.00) 1.0000 (0.00) 6m15s 29s
Rules + Memo 0.9796 (0.00) 0.9167 (0.00) 14s 10s
Rules + Memo + TED 1.0000 (0.00) 1.0000 (0.00) 6m56s 31s
Rules + TED 1.0000 (0.00) 1.0000 (0.00) 6m27s 29s

Dataset: insurance-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 71b6ac6f9854f63e1c30b9c1ec744c9d9abf5a1a Configuration repository branch: main

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 25s, train: 49s, total: 1m14s
1.0000 (0.00) 1.0000 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 50s, train: 1m3s, total: 1m52s
1.0000 (0.00) no data no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 25s, train: 50s, total: 1m15s
1.0000 (0.00) 1.0000 (0.00) no data
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 55s, train: 1m11s, total: 2m6s
1.0000 (0.00) 1.0000 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 21s, train: 43s, total: 1m3s
1.0000 (0.00) 1.0000 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 49s, train: 1m1s, total: 1m49s
1.0000 (0.00) 1.0000 (0.00) no data
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.5909 (0.00) 0.0000 (0.00) 17s 9s
Rules + AugMemo 1.0000 (0.00) 1.0000 (0.00) 20s 10s
Rules + AugMemo + TED 1.0000 (0.00) 1.0000 (0.00) 12m57s 30s
Rules + Memo 0.7600 (0.00) 0.5000 (0.00) 18s 10s
Rules + Memo + TED 1.0000 (0.00) 1.0000 (0.00) 13m17s 31s
Rules + TED 1.0000 (0.00) 1.0000 (0.00) 13m3s 30s

Dataset: retail-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 8226b51b4312aa4d3723098cf6d4028feea040b4 Configuration repository branch: main

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 32s, train: 51s, total: 1m22s
0.8387 (0.00) 0.2857 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 58s, train: 1m6s, total: 2m4s
0.8750 (0.00) 0.2857 (0.00) no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 34s, train: 1m3s, total: 1m36s
0.9375 (0.00) 0.2857 (0.00) no data
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 1m5s, train: 1m29s, total: 2m34s
0.8125 (0.00) 0.2857 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 25s, train: 48s, total: 1m12s
1.0000 (0.00) 0.2857 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 56s, train: 1m9s, total: 2m5s
0.9375 (0.00) 0.2857 (0.00) no data
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.9531 (0.00) 0.7778 (0.00) 9s 10s
Rules + AugMemo 0.9692 (0.00) 0.8889 (0.00) 10s 10s
Rules + AugMemo + TED 1.0000 (0.00) 1.0000 (0.00) 4m22s 29s
Rules + Memo 0.9692 (0.00) 0.8889 (0.00) 11s 10s
Rules + Memo + TED 1.0000 (0.00) 1.0000 (0.00) 4m36s 30s
Rules + TED 1.0000 (0.00) 1.0000 (0.00) 4m22s 28s

rasabot avatar May 23 '22 04:05 rasabot