higgsfield PPO implementation

Hi,

Is this version still up to date? I've run it with no changes but the agent scores oscillate at around -1000. I set max_frames = 100000 and still the agent doesn't improve beyond -800 reward and tends to have large performance drop offs in terms of score.

Jan 28 '19 15:01 nanastassacos

See improvements if you increase the num_steps and reset the environment after each iteration. I set num_steps to 40 from 20 and decreased the number of PPO updates to 1 per batch and the learning rate to 1e-4 and saw a significant improvement though generally it required a hefty number of iterations.

Jan 28 '19 17:01 nanastassacos

Having the same trouble. Unsure what's wrong with the implementation but it isn't working very well for me either myplot

Jan 29 '19 18:01 harry-uglow

I've found increasing the num_steps to around 40 and increasing the mini_batch_size to 10-15 results in decent performance. Try decreasing the number of ppo_epochs as well and slowly increasing it based on results.

Hope that helps!

Jan 30 '19 11:01 nanastassacos

The real "game changer" for me was to run envs_reset() at each iteration

Feb 02 '19 01:02 dariocazzani

Anyone who got this to work? My results didn't improve from the methods suggested. Could it be something else? Maybe setup?

Apr 16 '19 07:04 alexdupond

I was looking into this and I noticed that the results seem mostly dependent on the weight initialization. If you simply rerun the file multiple times, you can get results like seen above or good results like seen on the github repo. I am not sure the correct fix, but the things listed above seem to help a bit but not do anything super major.

May 06 '19 20:05 bhansconnect

Also, commenting out self.apply(init_weights) makes things run slightly better in general.

May 06 '19 20:05 bhansconnect

So, I tested a large number of hyper parameters and this seems to work a lot more consistently:

hidden_size      = 32
lr               = 1e-3
num_steps        = 128
mini_batch_size  = 256
ppo_epochs       = 30

and make sure to remove/comment out self.apply(init_weights) in the neural network description.

It still isn't perfect but works better overall.

Lastly I advise updating ppo_iter to this:

def ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantage):
    batch_size = states.size(0)
    ids = np.random.permutation(batch_size)
    ids = np.split(ids[:batch_size // mini_batch_size * mini_batch_size], batch_size // mini_batch_size)
    for i in range(len(ids)):
        yield states[ids[i], :], actions[ids[i], :], log_probs[ids[i], :], returns[ids[i], :], advantage[ids[i], :]

May 09 '19 04:05 bhansconnect

So, I tested a large number of hyper parameters and this seems to work a lot more consistently:
hidden_size      = 32
lr               = 1e-3
num_steps        = 128
mini_batch_size  = 256
ppo_epochs       = 30
and make sure to remove/comment out self.apply(init_weights) in the neural network description.

It still isn't perfect but works better overall.

Lastly I advise updating ppo_iter to this:
def ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantage):
    batch_size = states.size(0)
    ids = np.random.permutation(batch_size)
    ids = np.split(ids[:batch_size // mini_batch_size * mini_batch_size], batch_size // mini_batch_size)
    for i in range(len(ids)):
        yield states[ids[i], :], actions[ids[i], :], log_probs[ids[i], :], returns[ids[i], :], advantage[ids[i], :]

My result from what you suggested! It really improves the performance from -1000 to -171 !

Jun 25 '19 08:06 jsrimr

Who have better GAIL hyper parameters?

Sep 26 '20 09:09 lucifer2859

In GAIL, I tested this hyper parameters and this seems to work a lot more consistently: a2c_hidden_size = 32 discrim_hidden_size = 128 lr = 1e-3 num_steps = 128 mini_batch_size = 256 ppo_epochs = 30 threshold_reward = -200

Sep 26 '20 09:09 lucifer2859

So, I tested a large number of hyper parameters and this seems to work a lot more consistently:
hidden_size      = 32
lr               = 1e-3
num_steps        = 128
mini_batch_size  = 256
ppo_epochs       = 30
and make sure to remove/comment out self.apply(init_weights) in the neural network description.

It still isn't perfect but works better overall.

Lastly I advise updating ppo_iter to this:
def ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantage):
    batch_size = states.size(0)
    ids = np.random.permutation(batch_size)
    ids = np.split(ids[:batch_size // mini_batch_size * mini_batch_size], batch_size // mini_batch_size)
    for i in range(len(ids)):
        yield states[ids[i], :], actions[ids[i], :], log_probs[ids[i], :], returns[ids[i], :], advantage[ids[i], :]

Is it not better to use ids = np.array_split(ids,batch_size //mini_batch_size) instead of ids = np.split(ids[:batch_size // mini_batch_size * mini_batch_size], batch_size // mini_batch_size) to avoid wasting the remaining part of the batch, if the division batch_size //mini_batch_size has a remainder? The code would be more general I guess

Apr 24 '21 00:04 Alessiobrini

higgsfield higgsfield copied to clipboard

PPO implementation

higgsfield
higgsfield copied to clipboard