rasa
rasa copied to clipboard
Pulled #8357 forward to current main
Proposed changes:
- pulling #8357 forward to current main
- solving memory problems with GPU and e2e training
Status (please check what you already did):
Hey @dakshvar22 I think you could review the current status now. I solved the feature length problem by calculating the IDs adhoc and passing on some of the labelling config as a instance variable for the TED model.
As I said the trainings overall look good, as they are running. At the same time, I am thinking there might be still sth wrong with training or inference. When training and testing on 6 percent of the data I am getting the following results from rasa test core -s stories_6p.yml
:
2021-06-06 15:15:38 INFO rasa.core.test - Finished collecting predictions.
2021-06-06 15:15:38 INFO rasa.core.test - Evaluation Results on CONVERSATION level:
2021-06-06 15:15:38 INFO rasa.core.test - Correct: 0 / 498
2021-06-06 15:15:38 INFO rasa.core.test - Accuracy: 0.000
2021-06-06 15:15:40 INFO rasa.core.test - Stories report saved to results/story_report.json.
2021-06-06 15:15:40 INFO rasa.nlu.test - Evaluation for entity extractor: TEDPolicy
2021-06-06 15:15:40 WARNING rasa.model_testing - No labels to evaluate. Skip evaluation.
2021-06-06 15:15:40 INFO rasa.nlu.test - Classification report saved to results/TEDPolicy_report.json.
2021-06-06 15:15:40 INFO rasa.nlu.test - Incorrect entity predictions saved to results/TEDPolicy_errors.json.
2021-06-06 15:15:42 INFO rasa.utils.plotting - Confusion matrix, without normalization:
[[ 0 0 0 0 0 0]
[ 0 0 0 0 0 0]
[ 2 1 82 1 2 2]
[ 0 0 0 0 0 0]
[ 0 0 0 0 0 0]
[ 0 0 0 0 0 0]]
2021-06-06 15:15:43 INFO rasa.core.test - Evaluation Results on ACTION level:
2021-06-06 15:15:43 INFO rasa.core.test - Correct: 3426 / 6852
2021-06-06 15:15:43 INFO rasa.core.test - F1-Score: 0.500
2021-06-06 15:15:43 INFO rasa.core.test - Precision: 0.500
2021-06-06 15:15:43 INFO rasa.core.test - Accuracy: 0.500
2021-06-06 15:15:43 INFO rasa.core.test - In-data fraction: 0
2021-06-06 15:15:54 INFO rasa.utils.plotting - Confusion matrix, without normalization:
[[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]
...
[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]]
Looks like it's just predicting a single action or so? I saw the same when I trained another model on 12 percent of the data and testing it on 6 percent, where the 6 percent where a subset of the 12 percent dataset. If I remember correctly, you had voiced doubt about e2e compatibility with rasa test, so maybe that is why we are seeing this? Also, seems creation of the confusion matrix image absolutely trashes the machines resources with 1000s of labels in e2e. Rasa test doesn't really stop and is difficult to cancel after this log.
I am wondering what things I could do to further verify that the solution is working - Another thing that caught my attention is that the loss during training stays relatively high, while accuracy is already 1.0.
Epochs: 100%|█████████████████████████| 20/20 [07:38<00:00, 22.95s/it, t_loss=32.5, loss=30.6, acc=1, e_loss=0.638, e_f1=0.959]
The final loss didn't seem much different when changing label_batch_size
from 64 to 256.
One test I plan to do today is running e2e on main for the small dataset that works (like 6percent of multiwoz) and see if training loss and rasa test behave similarly.
Any further thoughts or ideas? :)
I am thinking, potentially it's also a matter of the label batch size. training with 6percent of data is working on main, and the loss is going down. Given that there are 3400 bot utterances in the 6percent dataset I might just try a higher label batch size than I have done so far with currently 64 and once 256.
Have found a bug, there was a mistake in the indentation from copy and pasting your code. And It would only fill a single embedding instead of all sampled embeddings. So most of them were 0 😅
Ok loss is looking good now during training! rasa test results look v similar to main
One thing that is still at the back of my mind and might be worth addressing: We make sure to sample negative candidates according to the proper distribution of label name and label text. However, once we hand off those candidates to the DotProductLossLayer (DPLL), there is no guarantee that it will keep the ratio as well. This effect is particularly strong if we have many labels, a large label batch size, and relatively low num negative for the loss. On top of that, the DPLL does not sample uniquely, so it might sample the same labels multiple times. So even if label_batch_size == num_neg it is not a given that the ratio will be kept. I am not sure this is a huge problem, but it seems to run against the effort we are making in the TED model to sample with a certain ratio
On top of that, the DPLL does not sample uniquely, so it might sample the same labels multiple times.
Maybe better to replace the sampling inside the loss with uniform_candidate_sampler as well?
On top of that, the DPLL does not sample uniquely, so it might sample the same labels multiple times.
Maybe better to replace the sampling inside the loss with uniform_candidate_sampler as well?
Yeah, I think that would be a good idea in general, if we can get the necessary parameters in there. You need some integers there which could give some problems if you want to do it dynamically with the tensors and without eager execution.
And then I am wondering whether we should actually drop the label_batch_size parameter? Wouldn't num_negative be kind of enough? I guess the only advantage of having two parameters is that if you set label_batch_size to -1 you effectively deactivate this part of the algorithm, which you couldn't anymore if you dropped it. But maybe you could do a boolean parameter instead about sampling negative labels in a balanced way.
@dakshvar22 what do you think?
This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.