muzero-general
muzero-general copied to clipboard
added support for multiple dimension continuous action spaces
Four themes to changes
- prediction_policy_network output is 2*action space, one mean and standard deviation for each joint. Log_prob is summed after being calculated for each joint
- dynamics_encoded_state_network function now takes into account an action array
- Functions that now need to work for arrays: Np.random.choice, item, and dictionary
- changes for tensorboard to save video renders
Results: Sawyer shelf environment I added had reward of -43 which is not great but performs okay. It trained with one gpu for 110,000 training steps and 55,000 self play games over 10 days.