vowpal_wabbit [wip] update for probabilistic label tree reduction (#2766)

Hello!

This PR aims to add support for probabilities output for the PLT reduction + fix it in version 9+.

So far, I added support for the probabilities option, which makes the reduction to output (label_index, probability) as action_score pairs. I also added an additional dataset to the demo, that only needs a few seconds to run.

Unfortunately, when implementing support for probabilities output, I've noticed that the PLT reduction stopped working correctly between 8.11.0 and 9.0.0 releases. You can run the new demo I added on v8.11.0 and v9.0.0 and notice that on 8.11.0 it has nice performance, and for 9+ it returns almost constant prediction. In the past, there was a test for performance of this reduction implemented in test dir, but it seems to me that it was removed before 9.0.0 release, probably because of that, it hasn't been noticed. So this needs to be fixed. The release notes for 9.0.0 does not mention anything that could be a reason for that. Were there any modifications to single_learner (learner<char, example>) that could change its behavior? Any pointers?

Sep 02 '22 10:09 mwydmuch

Thanks so much for this!

I am not aware of any PLT tests being removed. The testing infrastructure changed though. I see a number of PLT tests in the main test file https://github.com/VowpalWabbit/vowpal_wabbit/blob/802ce7ba36e363ffbacbc90a5f1c95f359c91f11/test/core.vwtest.json#L2802 Is that what you are referring to? Or are there different tests you mean?

I am not aware of any changes that came in that time frame that would break things. I will spend some time today comparing 8.11.0 and 9.0.0 with the new demo you added and let you know what I find.

EDIT: This bug was fixed in 9.3.0, that definitely broke PLT for a few versions. Does 9.3.0 behave more expectedly? The tests that were in place did not catch this bug unfortunately.

Sep 02 '22 13:09 jackgerrits

@jackgerrits I apologize I tried to search for that test, but somehow didn't find it initially, thank you for linking. The test trains the model with --sgd option (probably added for speed), the demos are without --sgd since the default adaptive optimizer usually gives a better predictive performance. My quick test shows, that with --sgd it works in both v8 and v9, but for the adaptive, the results are good in v8 but not in v9 (I'm using the current master), so something probably changed in the behavior of the adaptive classifier between versions. I will try to investigate further later.

Sep 02 '22 15:09 mwydmuch

The main difference between 8 and 9 is save_resume becoming the default. To go back to pre-9 behavior --predict_only_model can be used when saving and loading models. Perhaps that is the difference?

Sep 02 '22 15:09 jackgerrits

This seems to be it! With --predict_only_model, it works correctly. Ok, thank you, @jackgerrits, I will try to fix it.

Sep 02 '22 15:09 mwydmuch

New changes to the PLT reduction:

With --probabilities flag, plt reduction output now probabilities of predicted labels in a similar way to multilabel_oaa reduction, but the output is not sorted according to label indices.

Example of multilabel_oaa output with --probabilities flag for a single example:

0:0.518594 1:0.182738 2:0.00778663 3:0.423571 4:0.272818 5:0.304554 6:0.0889014 7:0.386894 8:0.169909 9:0.36006 10:0.499721 11:0.45407 12:0.234431 13:0.0984756 14:0.114563 15:0.237755 16:0.179683 17:0.468666 18:0.0964561 19:0.139548 20:0.231302 21:0.486725 22:0.121387 23:0.298874 24:0.158763 25:0.145428 26:0.452909 27:0.266938 28:0.400596 29:0.433964 30:0.0428393 31:0.101734 32:0.576381 33:0.182494 34:0.0601204 35:0.43837 36:0.600618 37:0.106431 38:0.0861215 39:0.134564 40:0.148813 41:0.272562 42:0.437119 43:0.126408 44:0.252151 45:0.328746 46:0.262799 47:0.240227 48:0.147751 49:0.278808 50:0.181505 51:0.055708 52:0.193509 53:0.202725 54:0.0526057 55:0.134752 56:0.0495995 57:0.356861 58:0.0845129 59:0.104566 60:0.39996 61:0.6342 62:0.114047 63:0.433779 64:0.125643 65:0.772348 66:0.708764 67:0.375339 68:0.166098 69:0.183479 70:0.185487 71:0.240553 72:0.335068 73:0.156084 74:0.141139 75:0.355316 76:0.143131 77:0.218201 78:0.612958 79:0.720201 80:0.476939 81:0.308423 82:0.0157075 83:0.182227 84:0.0388576 85:0.26275 86:0.107436 87:0.339874 88:0.366529 89:0.218675 90:0.480064 91:0.291455 92:0.275175 93:0.497424 94:0.267704 95:0.316201 96:0.504208 97:0.307344 98:0.470903 99:0.466075 100:0.176963
0,32,65,66,96

Example of plt output with --probabilities flag for a single example:

65:0.630946 66:0.754826 67:0.625329
65,66,67

Is it ok? It also seems to me to be not consistent with other binary and multiclass reductions that output a single line per example.

Modified loss reporting. The MULTILABEL::output_example calculates hamming loss (and expects output labels to be sorted by indices), I overwrite this for PLT (mostly by copying and modifying the code form multilabel.cc), for plt a sum of binary losses is returned during training and 0 for prediction. Not sure if this is ok, but seemed to be the most logical solution for this reduction.

Questions:

I don't know why so many CI/CD jobs fail? Some of the errors seem not to be a result of my changes.
It seems that there might be something wrong with multilabel_oaa reduction when using --probabilities flag. It predicts (writes to output) different labels than without the flag.
It seems that MULTILABEL::output_example does not use all.sd->ldict, is that intended?

Could you comment on my questions and review my changes now @zwd-ms @jackgerrits?

Nov 26 '22 11:11 mwydmuch

I also have some problems with passing code formatting tests, it looks like that ./utl/clang-format.sh docker fix formats code in a way that does not pass in lint.c++.clang-tidy job.

Nov 26 '22 11:11 mwydmuch

Thanks for making the changes @mwydmuch

New changes to the PLT reduction:

Is it ok? It also seems to me to be not consistent with other binary and multiclass reductions that output a single line per example.

From the example given, the output appears to be sorted by indices. Do you mean other reductions report probabilities for all indices? @jackgerrits is there any requirement on the multiclass output format?

Modified loss reporting. The MULTILABEL::output_example calculates hamming loss (and expects output labels to be sorted by indices), I overwrite this for PLT (mostly by copying and modifying the code form multilabel.cc), for plt a sum of binary losses is returned during training and 0 for prediction. Not sure if this is ok, but seemed to be the most logical solution for this reduction.

Otherwise, the updated output looks reasonable to me. It would be better to document these somewhere in the wiki later.

Questions:

I don't know why so many CI/CD jobs fail? Some of the errors seem not to be a result of my changes.

One issue is that the --probabilities flag needs to be included in the help output file train-sets/ref/help.stdout.

It seems that there might be something wrong with multilabel_oaa reduction when using --probabilities flag. It predicts (writes to output) different labels than without the flag.

Would you mind filing a separate bug for this? Thanks!

It seems that MULTILABEL::output_example does not use all.sd->ldict, is that intended?

This needs further investigation. Will try to get back to you.

Nov 28 '22 17:11 zwd-ms

@mwydmuch I merged latest master and fixed the tests.

Dec 13 '22 22:12 zwd-ms

Sorry about the delay on getting back to you @mwydmuch. I'm working on this now and hope to get this merged tomorrow.

Is it ok? It also seems to me to be not consistent with other binary and multiclass reductions that output a single line per example. We should stick with a single line per example. Outputting either probs or just labels based on probabilities is the standard way to do this. Generally speaking the output of --predictions matches the output prediction type of the reduction and since the flag changes the output type it's all consistent here.

65:0.630946 66:0.754826 67:0.625329

This still looks sorted by label index. Looks fine to me.

Modified loss reporting. The MULTILABEL::output_example calculates hamming loss (and expects output labels to be sorted by indices), I overwrite this for PLT (mostly by copying and modifying the code form multilabel.cc), for plt a sum of binary losses is returned during training and 0 for prediction. Not sure if this is ok, but seemed to be the most logical solution for this reduction.

Otherwise, the updated output looks reasonable to me. It would be better to document these somewhere in the wiki later.

I agree with Wangda. Totally reasonable. Reductions get to decide how they calculate loss so its good. Documenting in the wiki would be fantastic.

It seems that MULTILABEL::output_example does not use all.sd->ldict, is that intended?

multilabels based reductions do not currently make use of this so should be ignored for now during output.

Dec 14 '22 22:12 jackgerrits

@mwydmuch CI should be happy now. I noticed that there aren't any tests for the new --probabilities functionality. Would you be able to add a test or two for that? As soon as that is done we can go ahead and merge this.

Dec 14 '22 22:12 jackgerrits

@jackgerrits, @zwd-ms thank you for your comments and improvements, and sorry again for the long response time.

I noticed that there aren't any tests for the new --probabilities functionality.

I've added tests for the --probabilities option.

I agree with Wangda. Totally reasonable. Reductions get to decide how they calculate loss so its good. Documenting in the wiki would be fantastic.

Cool, I will be happy to add a Wiki page describing PLT reduction, I have time to do it before the end of the year.

(...)

This still looks sorted by label index. Looks fine to me.

Sorry, I gave a bad example here. The PLT outputs labels in the order they are found during the tree search. In the case of top-k they are naturally ordered by decreasing probability. For prediction with threshold, the ordering look even more random (both probabilities and label indices are mixed), example:

84:0.555875,67:0.86485,51:0.67529,31:0.522847,33:0.864

If ordering by label indices is a hard assumption in VW, I can add some additional sorting to make it always ordered according to label indices.

An additional question about the small quality of life improvement. Is it possible to set --loss_function to logistic by default for plt reduction (and if yes, then what is the intended way of doing it)? Right now, it always needs to be added as an argument. I think there was a code that was doing it in the implementation in v8.X.X, but it was changed later in v9.X.X.

Dec 21 '22 02:12 mwydmuch

Sorry, I gave a bad example here. The PLT outputs labels in the order they are found during the tree search. In the case of top-k they are naturally ordered by decreasing probability. For prediction with threshold, the ordering look even more random (both probabilities and label indices are mixed), example:
84:0.555875,67:0.86485,51:0.67529,31:0.522847,33:0.864
If ordering by label indices is a hard assumption in VW, I can add some additional sorting to make it always ordered according to label indices.

I think it is okay to leave it as is, all of the information is there so a user can sort as they wish. If you wanted to you could sort it but I'm not sure if by class or prob makes more sense. This can be easily fixed later if it is an issue for anyone.

An additional question about the small quality of life improvement. Is it possible to set --loss_function to logistic by default for plt reduction (and if yes, then what is the intended way of doing it)? Right now, it always needs to be added as an argument. I think there was a code that was doing it in the implementation in v8.X.X, but it was changed later in v9.X.X.

It is possible to override the loss_function type (example), but I think this example is not quite correct. We need to override the option value in addition to setting all.loss and we should print a warning if we are overriding a user supplied loss function.

I think both of these things are low priority and could be merged without (and/or fixed later), so if you'd rather just see this merged then I'm happy.

Dec 21 '22 14:12 jackgerrits

@jackgerrits Thanks for fixing the warning.

I think both of these things are low priority and could be merged without (and/or fixed later), so if you'd rather just see this merged then I'm happy.

Ok, then let's merge it, I'm happy with the state it is right now. I will write a Wiki page on plt reduction soon.

Dec 21 '22 15:12 mwydmuch

Just noticed the changes to save_load, are they necessary? If we make changes to this function we'll need to make sure we can still load models that were written by older versions of VW but write the new version going forward.

If the change is necessary I can go in and fix it up.

Dec 21 '22 15:12 jackgerrits

vowpal_wabbit vowpal_wabbit copied to clipboard

[wip] update for probabilistic label tree reduction (#2766)

New changes to the PLT reduction:

Questions:

New changes to the PLT reduction:

Questions:

vowpal_wabbit
vowpal_wabbit copied to clipboard