agents icon indicating copy to clipboard operation
agents copied to clipboard

PPO with Mini-Batches Tutorial

Open kochlisGit opened this issue 1 year ago • 0 comments

The documentation of PPO describes the training process of PPO as the following:

# Build PPO agent
ppo_agent = PPOClipAgent(num_epochs=40, ...)

# Build Replay Buffer
replay_buffer = TFUniformReplayBuffer(data_spec=ppo_agent.collect_data_spec,batch_size=env.batch_size, max_length=1000)

# Train agent
experiences, _ = replay_buffer.gather_all()
loss = ppo_agent.train(experiences).loss
replay_buffer.clear()

However, that way you train ppo_agent with 1 large batch of experiences for 40 epochs. However, if the number of experiences is high (e.g. 1024 experiences), you might want to to train PPO on mini batches (e.g. 4 mini-batches of 256 experiences, 40 epochs each mini-batch).

The only way to do that is to build a dataset from replay_buffer and fetch experiences by iterating the dataset. However, this produces random batches, instead of equally selected mini-batches:

# Use 1 epoch per batch
ppo_agent = PPOClipAgent(num_epochs=1, ...)

# Build dataset iter
dataset = replay_buffer.as_dataset(sample_batch_size=200, num_steps=2, num_parallel_calls=2).prefetch(2)
dataset_iter = iter(dataset)

# Training part
loss = 0
for _ in range(40):
    for _ in range(4):
        mini_batch_experiences, _ = next(dataset_iter)
        loss += ppo_agent.train(mini_batch_experiences)
replay_buffer.clear()
loss /= (40*4)

However, this approach has the following issue: It randomly selects 256 experiences from the memory, in a uniform way, but that doesn't ensure that each experience will be equally selected. Is there a better method to train PPO? Also, for some reason, this takes way more time to train than using a single batch as in the approach, and gets worse training results, so am I missing something else here?

kochlisGit avatar Dec 14 '22 10:12 kochlisGit