rasa Scheduled Model Regression Test Performance Drops

Scheduled Model Regression Test Performance Drops

Open rasabot opened this issue 2 years ago • 0 comments

This PR is automatically created by the Scheduled Model Regression Test workflow. Checkout the Github Action Run here.
---
Description of Problem:
Some test performance scores decreased. Please look at the following table for more details.
Dataset: Carbon Bot, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `1m52s`, train: `4m1s`, total: `5m53s`	0.7942 (0.00)	0.7529 (0.00)	0.5382 (0.00)
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `2m56s`, train: `5m17s`, total: `8m13s`	0.8078 (0.00)	0.7787 (0.00)	0.5298 (0.00)
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `2m9s`, train: `6m26s`, total: `8m35s`	0.7903 (0.01)	0.7529 (0.00)	0.5629 (0.00)
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `3m16s`, train: `6m13s`, total: `9m29s`	0.7806 (0.00)	0.7880 (0.00)	0.5430 (-0.02)
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `49s`, train: `3m18s`, total: `4m6s`	0.7437 (0.00)	0.7529 (0.00)	0.5166 (-0.06)
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `1m51s`, train: `5m13s`, total: `7m4s`	0.7398 (0.00)	0.7022 (0.00)	0.5166 (-0.01)

Dataset: Hermit, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `3m28s`, train: `22m7s`, total: `25m35s`	0.8987 (0.00)	0.7504 (0.00)	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `4m7s`, train: `30m58s`, total: `35m5s`	0.8717 (0.00)	0.7504 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `1m16s`, train: `23m18s`, total: `24m33s`	0.8299 (-0.00)	0.7504 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `2m2s`, train: `19m4s`, total: `21m6s`	0.8346 (0.00)	0.7562 (-0.00)	`no data`

Dataset: Private 1, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `2m45s`, train: `4m50s`, total: `7m36s`	0.9096 (0.00)	0.9612 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `3m38s`, train: `4m45s`, total: `8m23s`	0.9148 (0.00)	0.9717 (0.00)	`no data`
`Spacy + DIET(bow) + ResponseSelector(bow)` test: `45s`, train: `3m51s`, total: `4m36s`	0.8420 (0.00)	0.9574 (0.00)	`no data`
`Spacy + DIET(seq) + ResponseSelector(t2t)` test: `1m32s`, train: `4m26s`, total: `5m58s`	0.8524 (0.00)	0.9445 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `40s`, train: `4m51s`, total: `5m31s`	0.8940 (-0.01)	0.9612 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `1m18s`, train: `4m25s`, total: `5m42s`	0.9064 (0.00)	0.9689 (-0.00)	`no data`
`Sparse + Spacy + DIET(bow) + ResponseSelector(bow)` test: `53s`, train: `5m40s`, total: `6m33s`	0.8898 (-0.01)	0.9574 (0.00)	`no data`
`Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)` test: `1m41s`, train: `5m28s`, total: `7m8s`	0.9002 (0.01)	0.9699 (0.00)	`no data`

Dataset: Private 2, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `2m43s`, train: `11m55s`, total: `14m37s`	0.8745 (0.00)	`no data`	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `2m54s`, train: `6m43s`, total: `9m36s`	0.8830 (0.00)	`no data`	`no data`
`Spacy + DIET(bow) + ResponseSelector(bow)` test: `45s`, train: `6m36s`, total: `7m21s`	0.7253 (0.00)	`no data`	`no data`
`Spacy + DIET(seq) + ResponseSelector(t2t)` test: `55s`, train: `6m53s`, total: `7m47s`	0.7833 (-0.00)	`no data`	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `40s`, train: `5m50s`, total: `6m30s`	0.8509 (0.00)	`no data`	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `50s`, train: `5m44s`, total: `6m34s`	0.8530 (0.00)	`no data`	`no data`
`Sparse + Spacy + DIET(bow) + ResponseSelector(bow)` test: `58s`, train: `9m17s`, total: `10m14s`	0.8562 (-0.01)	`no data`	`no data`
`Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)` test: `1m3s`, train: `7m42s`, total: `8m45s`	0.8519 (0.00)	`no data`	`no data`

Dataset: Private 3, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `1m11s`, train: `1m16s`, total: `2m26s`	0.9177 (0.00)	`no data`	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `1m15s`, train: `55s`, total: `2m10s`	0.8436 (0.00)	`no data`	`no data`
`Spacy + DIET(bow) + ResponseSelector(bow)` test: `42s`, train: `1m5s`, total: `1m47s`	0.6173 (0.00)	`no data`	`no data`
`Spacy + DIET(seq) + ResponseSelector(t2t)` test: `50s`, train: `54s`, total: `1m43s`	0.6255 (0.00)	`no data`	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `39s`, train: `1m21s`, total: `2m0s`	0.8683 (0.00)	`no data`	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `41s`, train: `52s`, total: `1m32s`	0.8642 (0.00)	`no data`	`no data`
`Sparse + Spacy + DIET(bow) + ResponseSelector(bow)` test: `44s`, train: `1m32s`, total: `2m15s`	0.8477 (0.00)	`no data`	`no data`
`Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)` test: `53s`, train: `1m9s`, total: `2m1s`	0.8601 (0.00)	`no data`	`no data`

Dataset: Sara, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `6m12s`, train: `7m19s`, total: `13m30s`	0.7169 (-0.00)	0.7949 (0.00)	0.7944 (-0.01)
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `7m0s`, train: `6m12s`, total: `13m12s`	0.7136 (0.00)	0.7925 (0.00)	0.7783 (0.00)
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `6m59s`, train: `11m40s`, total: `18m39s`	0.6933 (-0.00)	0.7949 (0.00)	0.8047 (0.01)
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `7m51s`, train: `9m6s`, total: `16m56s`	0.7025 (0.00)	0.7831 (-0.01)	0.7860 (0.00)
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `1m59s`, train: `8m30s`, total: `10m28s`	0.6692 (-0.00)	0.7949 (0.00)	0.7907 (-0.01)
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `2m46s`, train: `6m31s`, total: `9m17s`	0.6803 (0.00)	0.7692 (0.00)	0.7922 (0.02)

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.1266 (0.00)	0.0000 (0.00)	`5m52s`	`1m22s`
`Rules + AugMemo`	0.9211 (0.00)	0.6644 (0.00)	`6m16s`	`1m47s`
`Rules + AugMemo + TED`	0.9710 (0.00)	0.7534 (0.01)	`50m40s`	`3m27s`
`Rules + Memo`	0.3860 (0.00)	0.1438 (0.00)	`6m3s`	`1m31s`
`Rules + Memo + TED`	0.9496 (0.00)	0.6301 (-0.00)	`54m14s`	`3m37s`
`Rules + TED`	0.9483 (0.00)	0.6370 (0.01)	`52m35s`	`3m28s`

Dataset: financial-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 64b1086da6f52db86324ccdd9a768fea07be7186 Configuration repository branch: main

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `32s`, train: `1m8s`, total: `1m39s`	1.0000 (0.00)	0.8333 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `1m1s`, train: `1m18s`, total: `2m18s`	1.0000 (0.00)	0.8333 (0.00)	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `35s`, train: `1m16s`, total: `1m50s`	1.0000 (0.00)	0.8333 (0.00)	`no data`
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `1m3s`, train: `1m28s`, total: `2m31s`	1.0000 (0.00)	0.8800 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `23s`, train: `56s`, total: `1m18s`	0.9643 (0.00)	0.8333 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `52s`, train: `1m15s`, total: `2m6s`	0.9643 (0.00)	0.8800 (0.00)	`no data`

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.7218 (0.00)	0.5417 (0.00)	`29s`	`12s`
`Rules + AugMemo`	1.0000 (0.00)	1.0000 (0.00)	`33s`	`12s`
`Rules + AugMemo + TED`	1.0000 (0.00)	1.0000 (0.00)	`8m1s`	`40s`
`Rules + Memo`	0.9807 (0.00)	0.9167 (0.00)	`33s`	`13s`
`Rules + Memo + TED`	1.0000 (0.00)	1.0000 (0.00)	`7m22s`	`37s`
`Rules + TED`	0.8856 (0.00)	0.6875 (0.00)	`7m27s`	`36s`

Dataset: helpdesk-assistant, Dataset repository branch: fix-model-regression-tests (external repository), commit: d83f20e76447faf3c6eb31f6a1bc0576f28408e1 Configuration repository branch: main

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `26s`, train: `57s`, total: `1m22s`	1.0000 (0.00)	`no data`	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `55s`, train: `1m10s`, total: `2m4s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `29s`, train: `1m5s`, total: `1m33s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `58s`, train: `1m17s`, total: `2m14s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `18s`, train: `49s`, total: `1m8s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `47s`, train: `1m5s`, total: `1m51s`	1.0000 (0.00)	`no data`	`no data`

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.5714 (0.00)	0.2500 (0.00)	`13s`	`9s`
`Rules + AugMemo`	1.0000 (0.00)	1.0000 (0.00)	`15s`	`10s`
`Rules + AugMemo + TED`	1.0000 (0.00)	1.0000 (0.00)	`6m15s`	`29s`
`Rules + Memo`	0.9796 (0.00)	0.9167 (0.00)	`14s`	`10s`
`Rules + Memo + TED`	1.0000 (0.00)	1.0000 (0.00)	`6m56s`	`31s`
`Rules + TED`	1.0000 (0.00)	1.0000 (0.00)	`6m27s`	`29s`

Dataset: insurance-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 71b6ac6f9854f63e1c30b9c1ec744c9d9abf5a1a Configuration repository branch: main

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `25s`, train: `49s`, total: `1m14s`	1.0000 (0.00)	1.0000 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `50s`, train: `1m3s`, total: `1m52s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `25s`, train: `50s`, total: `1m15s`	1.0000 (0.00)	1.0000 (0.00)	`no data`
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `55s`, train: `1m11s`, total: `2m6s`	1.0000 (0.00)	1.0000 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `21s`, train: `43s`, total: `1m3s`	1.0000 (0.00)	1.0000 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `49s`, train: `1m1s`, total: `1m49s`	1.0000 (0.00)	1.0000 (0.00)	`no data`

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.5909 (0.00)	0.0000 (0.00)	`17s`	`9s`
`Rules + AugMemo`	1.0000 (0.00)	1.0000 (0.00)	`20s`	`10s`
`Rules + AugMemo + TED`	1.0000 (0.00)	1.0000 (0.00)	`12m57s`	`30s`
`Rules + Memo`	0.7600 (0.00)	0.5000 (0.00)	`18s`	`10s`
`Rules + Memo + TED`	1.0000 (0.00)	1.0000 (0.00)	`13m17s`	`31s`
`Rules + TED`	1.0000 (0.00)	1.0000 (0.00)	`13m3s`	`30s`

Dataset: retail-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 8226b51b4312aa4d3723098cf6d4028feea040b4 Configuration repository branch: main

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `32s`, train: `51s`, total: `1m22s`	0.8387 (0.00)	0.2857 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `58s`, train: `1m6s`, total: `2m4s`	0.8750 (0.00)	0.2857 (0.00)	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `34s`, train: `1m3s`, total: `1m36s`	0.9375 (0.00)	0.2857 (0.00)	`no data`
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `1m5s`, train: `1m29s`, total: `2m34s`	0.8125 (0.00)	0.2857 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `25s`, train: `48s`, total: `1m12s`	1.0000 (0.00)	0.2857 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `56s`, train: `1m9s`, total: `2m5s`	0.9375 (0.00)	0.2857 (0.00)	`no data`

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.9531 (0.00)	0.7778 (0.00)	`9s`	`10s`
`Rules + AugMemo`	0.9692 (0.00)	0.8889 (0.00)	`10s`	`10s`
`Rules + AugMemo + TED`	1.0000 (0.00)	1.0000 (0.00)	`4m22s`	`29s`
`Rules + Memo`	0.9692 (0.00)	0.8889 (0.00)	`11s`	`10s`
`Rules + Memo + TED`	1.0000 (0.00)	1.0000 (0.00)	`4m36s`	`30s`
`Rules + TED`	1.0000 (0.00)	1.0000 (0.00)	`4m22s`	`28s`

May 23 '22 04:05 rasabot

rasa rasa copied to clipboard

Scheduled Model Regression Test Performance Drops

rasa
rasa copied to clipboard