agents icon indicating copy to clipboard operation
agents copied to clipboard

How to use the replay buffer in tf_agents for contextual bandit, that predicts and trains on a daily basis

Open tejavenkatk opened this issue 2 years ago • 2 comments

I am using the tf_Agents library for contextual bandits usecase.

In this usecase predictions (daily range between 20k and 30k predictions, 1 for each user) are made daily (multiple times a day) and training only happens on all the predicted data from 4 days ago (Since the labels for predictions takes 3 days to observe).

The driver seems to replay only the batch_size number of experience (Since max_step length is 1 for contextual bandits). Also the replay buffer has the same constraint only handling batch size number of experiences.

I wanted to use checkpointer and save all the predictions (experience from driver which are saved in replay buffer) from the past 4 days and train only on the first of the 4 days saved on each given day.

I am unsure how to do the following (None of the examples or the documentation provides any details on usecases like these) and any help is greatly appreciate.

  1. How to (run the driver) save replay buffer using checkpoints for the entire day (a day contains, say, 3 predictions runs and each prediction will be made on 30,000 observations [say batch size of 16]). So in this case I need multiple saves for each day
  2. How to save the replay buffers for past 4 days (12 prediction runs ) and only retrieve the first 3 prediction runs (replay buffer and the driver run) to train for each day.
  3. Unsure how to handle the driver, replay buffer and checkpointer configurations given the above #1, #2 above

tejavenkatk avatar Apr 27 '22 18:04 tejavenkatk

Hi! I'm not a developer of TF-Agents, just another user, so please consider this as just my opinion/suggestion.

Intro

From what I understood, the significance of driver and replay_buffer lies mostly in the full-RL problems, e.g. when you want your agent to play a lot of chess:

  1. you can't batch observations/state/context, so you need to do a lot of environment-agent interactions
    • driver is there for this
  2. you want to capture several games (epochs) worth of data and use it to update your agent
    • replay_buffer will help you

However, in contextual bandits (CMAB), especially in the offline case (you have already collected some data), the importance of both of those objects is diminished, since:

  1. you can batch the observations (e.g. push all 30k observations at once)
    • this means much fewer environment-agent interactions
      • in this case, the driver is unlikely to bring a significant performance increase over a simple python for-loop
  2. with larger batch sizes, you can even update your agent on every batch, so you don't need to use replay_buffer
    • unless you want to do something more complex, such as sampling
    • you can still use it just to have it collect all data into a trajectory

Suggestions - back to your questions

From your question, I understood you are doing batch data processing several times a day (no online processing) with a sufficiently small dataset (thousands of observations) and for that, my suggestions to your questions are:

  1. I would collect the data from your replay_buffer (or time_step and policy_step) as tf.data.Dataset and save it on a hard drive, e.g. you can serialize each scoring output into a separate folder based on date-time.
  2. You would just load the data from the appropriate folder.
  3. This should be OK as well since we completely avoid these configurations.

Hope you will find this helpful :) Michal

kubistmi avatar Jun 28 '22 15:06 kubistmi

Thank you Michal @kubistmi

tejavenkatk avatar Sep 11 '22 00:09 tejavenkatk