elasticsearch-learning-to-rank icon indicating copy to clipboard operation
elasticsearch-learning-to-rank copied to clipboard

Add support for CatBoost models

Open ser0t0nin opened this issue 6 years ago • 6 comments

Hello! First thank you for your work and this useful plugin :)

In fact this is not an issue, but a suggestion / feature request. There is a very powerful machine learning library called CatBoost developed by Yandex - a big company, which is specialized on search and competes with Google in tens of countries. The CatBoost library provides a novel and powerful approach in leraning-to-rank optimization problem based on oblivious trees and categorical variables. This approach is also known as MatrixNet

A couple of months ago they added JSON-format for exporting their models, (see https://github.com/catboost/catboost/issues/23) after some users requested the feature to integrate it into your plugin (see https://github.com/catboost/catboost/issues/129). Thus, the structure of the exported JSON is very different from XGBoost one and the model seems not to be able to be converted from one to another due to different internal structures.

I think it would be very useful to add some type of support for the models because initially CatBoost was born to study for search results sorting. The library has a maven-based Java package so maybe you can include such a task into your developing pipeline.

ser0t0nin avatar Nov 08 '18 12:11 ser0t0nin

Sounds great! I think we'd be very open to a PR for this. Or one of the major companies that use this plugin (Yelp, Wikimedia, an OpenSource Connections client) would need to be interested in using catboost before existing contributors would be able to invest time into integrating it

softwaredoug avatar Nov 11 '18 14:11 softwaredoug

@ser0t0nin @softwaredoug - while it gets added to the plugin, is there a work around which can help to use catboost with this?

saurzcode avatar Dec 17 '18 18:12 saurzcode

@saurzcode sorry, I am not a Java programmer, so I cannot make any value here.

ser0t0nin avatar Dec 26 '18 16:12 ser0t0nin

Any updates on this as it is a great idea. Does anybody want to pick up on this?

damitkwr avatar Apr 15 '20 15:04 damitkwr

It would be very welcome!

Dumb CatBoost question - is there a way to serialize a CatBoost model to another format (like xgboosts format)?

softwaredoug avatar Apr 15 '20 15:04 softwaredoug

@softwaredoug CatBoost claims to serialize to a bunch of formats. https://catboost.ai/docs/concepts/applying__models.html

The ones that jump out are PMML (older standard emphasizing generality and portability), and ONNX (newer Microsoft+Facebook interop standard focused on pluggable "Execution Providers" for accelerating prediction).

*Vespa now supports ONNX as well.

The smart money is on ONNX as hardware providers have embraced it https://github.com/microsoft/onnxruntime/blob/master/README.md#supported-accelerators and it's getting adopted or adaptered by cloud ML providers (AzureML, Tensorflow). The focus seems to be traditional Neural ... but tree-models are supported ... don't know about accelerated yet.

Given that query-time prediction latency constraints directly impact throughput and rescore window-size (therefore LTR search relevance), there's a strong case to be made for integrating ONNX Runtime into the LTR plugin. (With the added bonus of a common serialization format that all new model shops like CatBoost are going to want to support).

peterdm avatar Jul 09 '20 16:07 peterdm