agents
agents copied to clipboard
How to use the replay buffer in tf_agents for contextual bandit, that predicts and trains on a daily basis
I am using the tf_Agents library for contextual bandits usecase.
In this usecase predictions (daily range between 20k and 30k predictions, 1 for each user) are made daily (multiple times a day) and training only happens on all the predicted data from 4 days ago (Since the labels for predictions takes 3 days to observe).
The driver seems to replay only the batch_size number of experience (Since max_step length is 1 for contextual bandits). Also the replay buffer has the same constraint only handling batch size number of experiences.
I wanted to use checkpointer and save all the predictions (experience from driver which are saved in replay buffer) from the past 4 days and train only on the first of the 4 days saved on each given day.
I am unsure how to do the following (None of the examples or the documentation provides any details on usecases like these) and any help is greatly appreciate.
- How to (run the driver) save replay buffer using checkpoints for the entire day (a day contains, say, 3 predictions runs and each prediction will be made on 30,000 observations [say batch size of 16]). So in this case I need multiple saves for each day
- How to save the replay buffers for past 4 days (12 prediction runs ) and only retrieve the first 3 prediction runs (replay buffer and the driver run) to train for each day.
- Unsure how to handle the driver, replay buffer and checkpointer configurations given the above #1, #2 above
Hi! I'm not a developer of TF-Agents, just another user, so please consider this as just my opinion/suggestion.
Intro
From what I understood, the significance of driver
and replay_buffer
lies mostly in the full-RL problems, e.g. when you want your agent
to play a lot of chess:
- you can't batch observations/state/context, so you need to do a lot of
environment-agent
interactions-
driver
is there for this
-
- you want to capture several games (epochs) worth of data and use it to update your
agent
-
replay_buffer
will help you
-
However, in contextual bandits (CMAB), especially in the offline case (you have already collected some data), the importance of both of those objects is diminished, since:
- you can batch the observations (e.g. push all 30k observations at once)
- this means much fewer
environment-agent
interactions- in this case, the
driver
is unlikely to bring a significant performance increase over a simple python for-loop
- in this case, the
- this means much fewer
- with larger batch sizes, you can even update your
agent
on every batch, so you don't need to usereplay_buffer
- unless you want to do something more complex, such as sampling
- you can still use it just to have it collect all data into a
trajectory
Suggestions - back to your questions
From your question, I understood you are doing batch data processing several times a day (no online processing) with a sufficiently small dataset (thousands of observations) and for that, my suggestions to your questions are:
- I would collect the data from your
replay_buffer
(ortime_step
andpolicy_step
) astf.data.Dataset
and save it on a hard drive, e.g. you can serialize each scoring output into a separate folder based on date-time. - You would just load the data from the appropriate folder.
- This should be OK as well since we completely avoid these configurations.
Hope you will find this helpful :) Michal
Thank you Michal @kubistmi