Surprise icon indicating copy to clipboard operation
Surprise copied to clipboard

Factorization machines

Open martincousi opened this issue 6 years ago • 2 comments

Here is a basic factorization machine algorithm that takes into account only the user and item ids. It is equivalent to SVD when using degree=2. I have implemented this algorithm with the tffm library as well as the polylearn library for testing purpose. I found that the tffm is the preferable one given the different options it allows. To be used with GridSearchCV and RandomizedSearchCV, it however requires a special value for the session_config argument (see doc).

It's yet unclear to me what should be good default values for the algorithm that would work in most settings. Currently, it appears that both algorithms are slow while I would have though that using tensorflow would be fast...

This PR also contains tests for the feature option to Dataset, Trainset, etc.

I am planning to construct more elaborate factorization machine algorithms. The tests for the factorization machine algorithms will follow.

martincousi avatar Apr 24 '18 18:04 martincousi

I have added three new factorization machine algos. They are many more possible but most of them can be accomplished by using the features. Additional ones could also be conceived when the library will support context (user-item pair features such as timestamp, location, etc.).

I would like these algos to be modular such that you can turn on/off implicit information, features, etc. I guess the best way would be to create the sparse lists in FMAlgo and turn on/off the different components in the children. What do you think? Also, should there be many FM objects or only one with multiple options?

By the way, the special value for session_config is not needed to do parallelization, at least not on my system.

martincousi avatar Apr 25 '18 20:04 martincousi

Thanks a lot,

Once again I really appreciate the efforts with the docs and the tests.

I'm definitely interested in adding FM into surprise! This is a lot of code for me to digest though ^^ and I don't have tons of free time ATM (should be easier in the following months), so I just wanted to make sure you know that the review process may take long.

should there be many FM objects or only one with multiple options?

I personally like it when there's a single uniform interface to deal with, but it should still be easy to use. Like, if there are lots of incompatible parameters in a single class, maybe it's best to separate them into different classes. I'll leave it to your own appreciation to decide what's best here.

Are you actually using the FM algos you implemented? If so, with what dataset? I'd like to play around with them to get a feel of how to use them, that would make the understanding of all the code (especially the feature part) a lot easier for me.

Thanks!

NicolasHug avatar Apr 27 '18 07:04 NicolasHug