mlsh
mlsh copied to clipboard
Observation of MovementBandits env
Hello, I have tried to train the MLSH policies under the MovementBandits environment, but outputs of the master policy seems to be random even after training.
The command I tried is here:
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits
I guess the master policy has to have observation about the correct goal to select sub policies, but the current implementation provides nothing about the correct goal. Do you have any updates about MovementBandits?
I got the same problem. I used the running code as Readme.txt, but it seems that the output results did not change during the training time. Could you please let me know if I run the code in a correct way?
@natsuki14 It is the meta-learning objective of the master policy to choose the best sub-policy to reach the current goal. The correct goal remains a latent condition of the MDP.
This depends on the hope that you develop two unique and useful sub-policies for the respective goals which may just take a long time. Lets say that both sub-policies converge to the same goal for some reason. Then it won't matter which sub-policy is selected leaving the master with a uniform policy. Just a possibility!
Hi, so sorry to ask a question unrelated to this topic. How do you run the code? How long did the training take? I trained the task'Ant Obstacles Gen-v1'. It is so slow. The other tasks are also very slow. Is there any solution?