vowpal_wabbit
vowpal_wabbit copied to clipboard
Make --plt predict probabilities?
Short description
At the moment VW only outputs predictions for multiclass problems when using --oaa
or --csoaa_ldf
options.
However, these are very slow for classification problems with a large number of classes. Fortunately, --plt
solves this speed problem. However, --plt
cannot output probabilities. The --probabilities
flag is simply ignored.
Are there any plans to extend --plt
so that it also outputs probabilities?
How this suggestion will help you/others
--plt
is very useful for classification problems with large number of classes. However, it is important for users to be able to assess how certain a prediction is in order to evaluate how much they trust the prediction.
Possible solution/implementation details
Example/links if any
#2613 has a very similar request and could also benefit from an enhancement of the --plt
.
@mwydmuch do you think it is feasible for plt to output probabilities?
Hi, @jackgerrits, I'm sorry for the late reply, I was quite busy the last two weeks. Indeed it would be useful to have an output with probabilities for PLT. It's not feasible to output probabilities for all labels using PLT, it only makes sense to output some of them (top k or above-given threshold). This requires an output type that supports pairs like (index, probability). I see that there is no such prediction type in VW, right? If not, I can prepare PR with a new field to the polyprediction
structure and add a new prediction type along with the option to output probabilities to PLT. I think I get the c++ part related to its prediction types, but does it also require updating other parts of the project, i.e. python binding?
Thanks for the response @mwydmuch, not a problem at all!
You're right, there isn't an index,prob pair prediction. Seems reasonable to add. If you do the C++ portion I am happy to finish off the Python bits. Please feel free to reach out to me if you need any assistance.
I wanted to ask what the status is here. It sounded as if you wanted to implement this. As far as I can see that hasn't happened yet though, has it? I would be very happy to hear from you.
Honestly, I completely forgot about this one. I'm really sorry. I will find some time to prepare PR this week.
Is there any news on this topic?
Hello @mwydmuch, I would also be very interested in this topic. Are there any News?
I saw PR #4138 tackles this feature request. Any news on the progress of the PR? Would be really great to have probabilities for the plt.
@mwydmuch, please let me know if I can help with the in progress PR in any way.
Hi @jackgerrits, I've found some time last week to finalize the PR, but after merging with the current upstream/master
I encounter a few new problems:
- Adaptive learning stopped working correctly. The model only reaches correct predictive performance when using
--sgd
. - There is some problem/change regarding the reading of MULITLABEL data. This seems to affect also multilabel_oaa reduction, which started to throw errors like this:
[error] label 363 is not in {0,3992} This won't work right.
on correct data.
At the moment I don't know what changed that is causing the problems. Since this seems to be a problem with some other change in VW core, I would appreciate your help on that.
Hi @mwydmuch,
- regarding the use of
--sgd
, I noticed some previous discussion in the PR and Jack's comment -- Did--predict_only_model
solve the problem? - Can you please provide an example leading to the error? That might help reproduce the issue and investigate further. Thanks
Hi @zwd-ms and @jackgerrits, the problem resolved itself when I sync again with upstream. Now, the PR seems to work correctly in all configurations. I updated the demo, and now it's a single file, plt_demo.py,
which makes it a bit easier to test a plt
and multilabel_oaa
reduction on a few datasets and with different parameters.
I still have some questions regarding what should be the best way to handle multilabel data. I posted it in the PR.
This has now been resolved in #2766