recommenders [Question] How to train/ continuous train with large dataset?

[Question] How to train/ continuous train with large dataset?

Open xiaoyaoyang opened this issue 3 years ago • 2 comments

The tutorial example illustrates how we recommend movies given to users. I have two follow-up questions regarding when this algorithm scale-up:

What if the number of users increases to 50M and the number of movies to 100k.(can't feed everything into memory..)
What to do when new users and new movies are added in, How would this architecture handle continuous training?

Thanks

Jan 06 '22 19:01 xiaoyaoyang

I also would like to know how do you handle models where the memory requirements start to get too large for a single computer to handle training.

As for 2., the usual answer is that you either use a hashing function to have some "space" for future users to fit into or move into a sequential model. For what I can tell, the former is just a temporary solution because the hashing function also has an upper limit on the number of users. The latter is the more common answer. In it you move away from user IDs and map users as a sequence of their past N interactions. This will have the downside of the model not having long term memory for that user.

Jan 30 '22 10:01 ThunderSmotch

@ThunderSmotch for (2, totally agree...

As for (1, by using TF dataset loader + TFRecord format (I think CSV also works), I could train on a very large dataset. I tried it with bunch of CSV files and it was super slow..I think I just don't quite understand how TF data works behind the scene, is it IO time the bottleneck, etc..

Feb 02 '22 22:02 xiaoyaoyang

recommenders recommenders copied to clipboard

[Question] How to train/ continuous train with large dataset?

recommenders
recommenders copied to clipboard