rasa Pulled #8357 forward to current main

Proposed changes:

pulling #8357 forward to current main
solving memory problems with GPU and e2e training

Status (please check what you already did):

[ ] added some tests for the functionality
[ ] updated the documentation
[ ] updated the changelog (please check changelog for instructions)
[ ] reformat files using black (please check Readme for instructions)

May 26 '21 08:05 twerkmeister

Hey @dakshvar22 I think you could review the current status now. I solved the feature length problem by calculating the IDs adhoc and passing on some of the labelling config as a instance variable for the TED model.

As I said the trainings overall look good, as they are running. At the same time, I am thinking there might be still sth wrong with training or inference. When training and testing on 6 percent of the data I am getting the following results from rasa test core -s stories_6p.yml:

2021-06-06 15:15:38 INFO     rasa.core.test  - Finished collecting predictions.
2021-06-06 15:15:38 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-06-06 15:15:38 INFO     rasa.core.test  -  Correct:          0 / 498
2021-06-06 15:15:38 INFO     rasa.core.test  -  Accuracy:         0.000
2021-06-06 15:15:40 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-06-06 15:15:40 INFO     rasa.nlu.test  - Evaluation for entity extractor: TEDPolicy
2021-06-06 15:15:40 WARNING  rasa.model_testing  - No labels to evaluate. Skip evaluation.
2021-06-06 15:15:40 INFO     rasa.nlu.test  - Classification report saved to results/TEDPolicy_report.json.
2021-06-06 15:15:40 INFO     rasa.nlu.test  - Incorrect entity predictions saved to results/TEDPolicy_errors.json.
2021-06-06 15:15:42 INFO     rasa.utils.plotting  - Confusion matrix, without normalization:
[[ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 2  1 82  1  2  2]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]]
2021-06-06 15:15:43 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2021-06-06 15:15:43 INFO     rasa.core.test  -  Correct:          3426 / 6852
2021-06-06 15:15:43 INFO     rasa.core.test  -  F1-Score:         0.500
2021-06-06 15:15:43 INFO     rasa.core.test  -  Precision:        0.500
2021-06-06 15:15:43 INFO     rasa.core.test  -  Accuracy:         0.500
2021-06-06 15:15:43 INFO     rasa.core.test  -  In-data fraction: 0
2021-06-06 15:15:54 INFO     rasa.utils.plotting  - Confusion matrix, without normalization:
[[1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]]

Looks like it's just predicting a single action or so? I saw the same when I trained another model on 12 percent of the data and testing it on 6 percent, where the 6 percent where a subset of the 12 percent dataset. If I remember correctly, you had voiced doubt about e2e compatibility with rasa test, so maybe that is why we are seeing this? Also, seems creation of the confusion matrix image absolutely trashes the machines resources with 1000s of labels in e2e. Rasa test doesn't really stop and is difficult to cancel after this log.

I am wondering what things I could do to further verify that the solution is working - Another thing that caught my attention is that the loss during training stays relatively high, while accuracy is already 1.0.

Epochs: 100%|█████████████████████████| 20/20 [07:38<00:00, 22.95s/it, t_loss=32.5, loss=30.6, acc=1, e_loss=0.638, e_f1=0.959]

The final loss didn't seem much different when changing label_batch_size from 64 to 256.

One test I plan to do today is running e2e on main for the small dataset that works (like 6percent of multiwoz) and see if training loss and rasa test behave similarly.

Any further thoughts or ideas? :)

Jun 07 '21 06:06 twerkmeister

I am thinking, potentially it's also a matter of the label batch size. training with 6percent of data is working on main, and the loss is going down. Given that there are 3400 bot utterances in the 6percent dataset I might just try a higher label batch size than I have done so far with currently 64 and once 256.

Jun 07 '21 08:06 twerkmeister

Have found a bug, there was a mistake in the indentation from copy and pasting your code. And It would only fill a single embedding instead of all sampled embeddings. So most of them were 0 😅

Jun 07 '21 13:06 twerkmeister

Ok loss is looking good now during training! rasa test results look v similar to main

Jun 07 '21 14:06 twerkmeister

One thing that is still at the back of my mind and might be worth addressing: We make sure to sample negative candidates according to the proper distribution of label name and label text. However, once we hand off those candidates to the DotProductLossLayer (DPLL), there is no guarantee that it will keep the ratio as well. This effect is particularly strong if we have many labels, a large label batch size, and relatively low num negative for the loss. On top of that, the DPLL does not sample uniquely, so it might sample the same labels multiple times. So even if label_batch_size == num_neg it is not a given that the ratio will be kept. I am not sure this is a huge problem, but it seems to run against the effort we are making in the TED model to sample with a certain ratio

Jun 08 '21 12:06 twerkmeister

On top of that, the DPLL does not sample uniquely, so it might sample the same labels multiple times.

Maybe better to replace the sampling inside the loss with uniform_candidate_sampler as well?

Jun 08 '21 21:06 dakshvar22

On top of that, the DPLL does not sample uniquely, so it might sample the same labels multiple times.

Maybe better to replace the sampling inside the loss with uniform_candidate_sampler as well?

Yeah, I think that would be a good idea in general, if we can get the necessary parameters in there. You need some integers there which could give some problems if you want to do it dynamically with the tensors and without eager execution.

And then I am wondering whether we should actually drop the label_batch_size parameter? Wouldn't num_negative be kind of enough? I guess the only advantage of having two parameters is that if you set label_batch_size to -1 you effectively deactivate this part of the algorithm, which you couldn't anymore if you dropped it. But maybe you could do a boolean parameter instead about sampling negative labels in a balanced way.

@dakshvar22 what do you think?

Jul 05 '21 05:07 twerkmeister

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 16 '22 07:04 stale[bot]

rasa rasa copied to clipboard

Pulled #8357 forward to current main

rasa
rasa copied to clipboard