Realtime recommendations with ALS
So I've found a way to efficiently generate real-time recommendations with ALS model:
- Train model & serialize item factors
- Spawn workers that will work on generating recommendations. In each create an empty model and fill it with unserialized item factors. User factors can be left empty as they won't be used. With pyarrow's plasma storage it's possible to use the same memory to store item factors for all workers.
- For each request populate user items with the items corresponding to the user requesting recommendations. Call recommend method with recalculate_user=True
It works pretty well (0.2 seconds per request on my 100M, 500K unique items dataset), but apparently, the main bottleneck is calculating dot product between user factors and item factors, almost all the time is spent on this operation. This can be easily parallelized so I managed to get an enormous speed up using CUDA to calculate dot product.
So my questions are:
- Does my approach to real-time recommendations make sense? Perhaps I'm missing something?
- Using CUDA to get dot product in recommend method might improve performance great time. Do you consider implementing it?
- Should I describe my approach in a blog post? Do you think it will be useful to other users?
Awesome! If you write a blog post I will link to it from the readme.
Your approach makes sense, I've done similar things for serving requests in production before (for news recommendations).
I tried out using CUDA before via cupy, but found it didn't noticeably speed things up. The dot product does take a fair amount of time, but I think the argpartition to get the top N results probably dominates the time. The dot product can easily be ported to CUDA, but getting an efficient selection algorithm is a little tricky (like cupy just sort's the whole input : https://github.com/cupy/cupy/pull/294 ).
So I did some benchmarking on my actual data & requests from the access logs. https://gist.github.com/dmitriyshashkin/7a85e6fd9a270d999bc79ebe1e398084 It confirms my earlier conclusion that dot product takes far more (30x) time than argpartition. Perhaps it's due to some peculiarities of my data. I guess I should try similar benchmark on lastfm dataset.
Ran same benchmark with GPU acceleration (on Google colab) https://gist.github.com/dmitriyshashkin/8f5d7eb3a36096bf5a6bb304163e0f36 and you're absolutely right. Argpartition in cupy is incredibly slow, it kills all performance gains from fast dot product.
I think your data is probably pretty normal - it makes sense that the dot product on the CPU will take more time. I was just noticing that the cupy code for argpartition didn't seem all that efficient =(.
For speeding up generating the results, I think the best take right now is to use something like NMSLIB or FAISS to generate approximate results. This will speed up recommendations by several orders of magnitude in your case, with only a slight loss in precision.