[Question] - Use Precomputed Embeddings as feature
hey @massquantity ! Thanks a lot for maintaining this amazing library that makes experimenting with state of the art RecSys feel like a breathe. This is more a question than an issue but it could reveal some limitation with the lib. As of now do you know if there is a handy way of passing precomputed embeddings as features ? E.G. item description or image embedding ? The workaround I found so far (not even sure it actually make sense) was to create one dense feature per embedding dimension which can lead to 1000+ dense features to pass individually. In the model of the multi-sparse column, would there be a multi-dense column option that we could use?
Thanks for the question! Currently, the library doesn’t support precomputed embeddings directly—the workaround of splitting them into individual dense features works but isn’t ideal. We may consider adding this feature in the future, but we can't offer a guarantee or timeline for implementation at this stage.
@massquantity Sorry to disturb you again. We're experimenting with the solution you suggested but realised that embeddings because they need to be joined to each user/item interaction causes the dataset size to explode ( depending on precision and dimension of course). Our initial plan way to leverage the retraining capability of the lib to mimic a batch training since we could not create a single training Dataset that would fit in memory (we're talking of mulitple 100GB of data, we used to work with dataset of around 20GB size in RAM during training). However it seems this solution requires to somewhat build the full dataset in the end which won't work.
Instead is there a way to somewhat "stream" the data to the training using data from disk effectively build a batch of Dataset to feed the training process?
Sorry there is no way to stream the data. The library is built with the assumption that the entire dataset can be loaded into memory.
While we recognize the need for handling larger-than-memory datasets and have considered it, it represents a significant amount of work and isn't on our immediate development path.