recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

Computing Top-K accuracy on validation data is unproportionately slow

Open msvensson222 opened this issue 2 years ago • 11 comments

I use TFRS in an e-commerce retail setting with a lot of purchase history data as well as click-stream data, using a multitask recommender model. I have created a model that works fine to train, evaluate and serve, with one unfortunate issue: validation during training takes 5x the time of the training, on a fraction of the data. To exemplify, I have 10M rows of interactions, that I split into 70/20/10 split for training/validation/test. There are 1M unique users and 100k unique items.

I train using 4 GPU's, and one epoch takes ~4 minutes to run through the 7M training rows, but another ~20 minutes to get Top-K accuracy on the validation data (2M rows). During training, I make sure to specify compute_metrics=not training.

My retrieval task is set up as follows, where I use as large batch-size as my hardware allows to speed it up as much as possible.

self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
    loss=...,
    metrics=tfrs.metrics.FactorizedTopK(candidates=items_ds.batch(8192).map(self.item_model))
)

And I train as follows:

model.fit(train.batch(BATCH_SIZE),
          epochs=15,
          validation_data=val.batch(BATCH_SIZE)
)

If I run 15 epochs I spend 1 hour training and 5 hours validating, on one seventh of the data. Is this immense time difference expected? Anything I can do to improve the performance during validation?

msvensson222 avatar Oct 12 '21 15:10 msvensson222

For background, evaluation is slow because true top-k evaluation requires computing top-k scores across all the candidates in your corpus. This scales with corpus size, if you have 100K candidates it will be 10 times slower than 10K candidates.

To speed this up you could use BruteForce evaluation:

for epoch in range(epochs):
  model.fit(...)

  model.retrieval_task.factorized_metrics = (
      tfrs.metrics.FactorizedTopK(
          candidates=tfrs.layers.factorized_top_k.BruteForce().index(items_ds.batch(8192).map(self.item_model)
      )
  )
  model.compile()
  
  model.evaluate(val.batch(BATCH_SIZE))

This will use a slightly faster way of computing the evaluation scores than the default. However, you will need to separate your fit and evaluate calls, and make sure to recompile the model before evaluating.

If this is still too slow, you can use ScaNN evaluation. This will work just as above, but with ScaNN instead of BruteForce. See tutorial for details.

maciejkula avatar Oct 12 '21 18:10 maciejkula

Thank you for the fast reply. Would you recommend using 'val_total_loss' or 'factorized_top_k/top_100_categorical_accuracy' to monitor during early stopping? My reasoning in a "regular" DNN model is to monitor validation loss since that's how confident the model is in the predictions, but maybe this type of recommender model could use another reasoning. Appreciate your insights, thanks.

msvensson222 avatar Oct 13 '21 07:10 msvensson222

I'd always use a real metric that I care about - the top-K accuracy in this case.

maciejkula avatar Oct 18 '21 23:10 maciejkula

@maciejkula Thanks for your reply regarding this question. I'd like to ask, why is it necessary to separate the .fit() and .evaluate() calls this way? Can't we modify the candidates= argument for the retriveval task for the BruteForce or Scann algorithm and not use FactorizedTopK at all? In my case, I have ~1M users, 50k items and ~20M training samples. When I turn the compute metrics argument on the training is super slow, and according to the profiler 80% of the time is spent with the metrics computation. So for the training I really have to turn off the metrics computation, however, it would be nice to compute at least the validation metrics (or the training metrics as well) after certain number of epochs, because the models are very sensitive to all kind of parameters. I guess the indexing matrix have to be built every time, so what would you recommend, compile and recompile every time after x number of epochs? Could you elaborate on this a bit more extensively, please?

hkristof03 avatar Jun 21 '22 07:06 hkristof03

@hkristof03,

The easiest way to evaluate metrics only for each validation step is to pass compute_metrics = not training in your call to the retrieval loss.

When the metrics are computed, the candidate embeddings are re-calculated for every validation batch. The code snippet @maciejkula posted saves computation by calculating the candidate embeddings only once per validation epoch.

If you wish to apply this optimisation you would indeed need to periodically stop training, recompile and call evaluate.

patrickorlando avatar Jun 22 '22 01:06 patrickorlando

@patrickorlando

This optimization works quite well, but it causes memory explosion, because the candidates are recomputed for each validation and for that a new computation graph is created each time (link).

We followed this tutorial to write our custom training & test loops, they are decorated with @tf.function. Do you have any suggestion to remove previous computation graphs? tf.keras.backend.clear_session() & gc.collect() do not solve the problem, they only partially release some memory, as we checked with tf.config.experimental.get_memory_info('GPU:0'). I did not find any options on the web to manually clear the tensorflow function cache.

This problem is related to this issue.

hkristof03 avatar Sep 13 '22 18:09 hkristof03

Hi @hkristof03,

I'm certainly not an expert in this area and you may have already implemented it in this way, but I'll share my thoughts. It might help if you could post a code sample of the val_step and where you are computing the new candidates before each validation epoch.

Since you only have 50K items, you can probably get away with the BruteForce Index.

My approach would be to:

  1. Create or load your candidates in a tf.data.Dataset.
  2. Create a BruteForce layer and index it with brute_force.index_from_dataset(candidate_ds.batch(512).map(candidate_model)). These vectors will initially be random.
  3. Define the train_step and test_step outside of your training loop. The val_step will use the index created above. Decorated these with tf.function.
  4. Before each validation epoch, re-index the retrieval layer (don't create a new layer each time), use the same candidate_ds. Ensure this function is not wrapped in a tf.function decorator.

I think this should avoid a memory explosion because the candidates are stored as tf.Variable and re-indexing will update the state of those variables, rather than creating new nodes in the graph. I'm not exactly sure how the ScaNN layer will behave, so I'd try the BruteForce first.

Let me know if this works 🤞

patrickorlando avatar Sep 17 '22 02:09 patrickorlando

Hi @maciejkula I tried your idea:

model.retrieval_task.factorized_metrics = ( tfrs.metrics.FactorizedTopK( candidates=tfrs.layers.factorized_top_k.BruteForce().index(items_ds.batch(8192).map(self.item_model) ) )

with the lines from the retrieval example:

me = movies.batch(8192).map(movie_model) candidates=tfrs.layers.factorized_top_k.BruteForce().index(me) task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK( # movies.batch(128).map(movie_model) # This was the original line from the tutorial candidates=candidates
) )

I got the following error:

Traceback (most recent call last): File "retrieval.py", line 80, in candidates=tfrs.layers.factorized_top_k.BruteForce().index(me) File "python3.7/site-packages/tensorflow_recommenders/layers/factorized_top_k.py", line 539, in index identifiers = tf.range(candidates.shape[0]) AttributeError: 'MapDataset' object has no attribute 'shape'

Thanks for your help!

houghtonweihu avatar Sep 01 '23 18:09 houghtonweihu

@houghtonweihu the index method expects a tensor of embeddings as input, you need to use the index_from_dataset method to index from a dataset.

Example:

brute_force = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
brute_force.index_from_dataset(
    movies.batch(128).map(lambda title: (title, model.movie_model(title)))
)

Make sure you are referencing the tutorial that matches the tfrs version you are using. Current Retrieval Tutorial

I also find reading the function docstring to be helpful when encountering errors like this.

patrickorlando avatar Sep 04 '23 00:09 patrickorlando

Hi @patrickorlando The following lines are usually used during inference after the model is trained:

brute_force = tfrs.layers.factorized_top_k.BruteForce(model.user_model) brute_force.index_from_dataset( movies.batch(128).map(lambda title: (title, model.movie_model(title))) )

However, the idea of @maciejkula

model.retrieval_task.factorized_metrics = ( tfrs.metrics.FactorizedTopK(
candidates=tfrs.layers.factorized_top_k.BruteForce().index(items_ds.batch(8192).map(self.item_model) ) )

is to speed up the training, if I understood his intent correctly.

houghtonweihu avatar Sep 05 '23 13:09 houghtonweihu

A lot of the other comments mention this but writing it here if this helps. If you don't want to be 100% exact about the accuracy measure you can use ScaNN in this manner:

self.scann = tfrs.layers.factorized_top_k.ScaNN()  # can change num_leaves, num_leaves_to_search to change time it takes and accuracy of result
self.scann.index_from_dataset(item_ds.batch(8192).map(self.item_model))  # This will take some time
self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
    loss=...,
    metrics=tfrs.metrics.FactorizedTopK(candidates=self.scann)
)

Should improve your metrics calculation runtime multifold

mayanksingh09 avatar Oct 24 '23 15:10 mayanksingh09