elasticsearch-learning-to-rank
elasticsearch-learning-to-rank copied to clipboard
Add support for CatBoost models
Hello! First thank you for your work and this useful plugin :)
In fact this is not an issue, but a suggestion / feature request. There is a very powerful machine learning library called CatBoost developed by Yandex - a big company, which is specialized on search and competes with Google in tens of countries. The CatBoost library provides a novel and powerful approach in leraning-to-rank optimization problem based on oblivious trees and categorical variables. This approach is also known as MatrixNet
A couple of months ago they added JSON-format for exporting their models, (see https://github.com/catboost/catboost/issues/23) after some users requested the feature to integrate it into your plugin (see https://github.com/catboost/catboost/issues/129). Thus, the structure of the exported JSON is very different from XGBoost one and the model seems not to be able to be converted from one to another due to different internal structures.
I think it would be very useful to add some type of support for the models because initially CatBoost was born to study for search results sorting. The library has a maven-based Java package so maybe you can include such a task into your developing pipeline.
Sounds great! I think we'd be very open to a PR for this. Or one of the major companies that use this plugin (Yelp, Wikimedia, an OpenSource Connections client) would need to be interested in using catboost before existing contributors would be able to invest time into integrating it
@ser0t0nin @softwaredoug - while it gets added to the plugin, is there a work around which can help to use catboost with this?
@saurzcode sorry, I am not a Java programmer, so I cannot make any value here.
Any updates on this as it is a great idea. Does anybody want to pick up on this?
It would be very welcome!
Dumb CatBoost question - is there a way to serialize a CatBoost model to another format (like xgboosts format)?
@softwaredoug CatBoost claims to serialize to a bunch of formats. https://catboost.ai/docs/concepts/applying__models.html
The ones that jump out are PMML (older standard emphasizing generality and portability), and ONNX (newer Microsoft+Facebook interop standard focused on pluggable "Execution Providers" for accelerating prediction).
*Vespa now supports ONNX as well.
The smart money is on ONNX as hardware providers have embraced it https://github.com/microsoft/onnxruntime/blob/master/README.md#supported-accelerators and it's getting adopted or adaptered by cloud ML providers (AzureML, Tensorflow). The focus seems to be traditional Neural ... but tree-models are supported ... don't know about accelerated yet.
Given that query-time prediction latency constraints directly impact throughput and rescore window-size (therefore LTR search relevance), there's a strong case to be made for integrating ONNX Runtime into the LTR plugin. (With the added bonus of a common serialization format that all new model shops like CatBoost are going to want to support).