tianshou
tianshou copied to clipboard
Varying action space and hierarchial action
Hi! Thanks for your excellent work!
I'm diving into a complex environment, where all the actions are not available at each step.
Let's say there are 1000 actions (range from "id_0" to "id_999"). At each step, there are only 10 actions are available. These available actions construct a set called 'candi_t' at time step t. For example,
[id_1, id_3, id_6, id_4, id_5,...., id_675] # 'candi_0' at step 0
[id_2, id_77, id_87, id_3, id_0,...., id_25] # 'candi_1' at step 1
Situation 1: The 'candi_t' at step t is known in advance (i.e., the env knows the 'candi_t'). Then, as far as I know, there are two ways to implement the agent:
A straightforward solution is to make all actions 'legal', but use a mask to filter out the available actions. Specifically, the actor gives a probability vector with size 1000, we give extremely small values to all the 'illegal' actions. This would involve a step in the Network that sets the probability outputs of illegal actions to 0, and re-normalizes the rest to sum up to 1 again. This method requires adding a 'mask' item in "Batch"
Another solution is that add 'candi_t' (10 available actions) in "Batch", and we customize an actor-network that calculates the similarity between state and each action in 'candi_t' and finally output a probability vector with size 10 at each step t. For example:
ava_actions = env.available_actions()
action = agent.act(state, ava_actions) # calculate the similarity
state, reward, done, info = env.step(action)
My first question is, which one is better with respect to the ‘Code simplicity’ and the 'learning performance'? It seems that I need to customize the "update" part for the agent for solution 2 since the input of the actor network is not just a state.
Situation 2: The 'candi_t' at step t are constructed with the help of the agent. For example:
action_1 = agent.act1(state)
ava_actions = env.available_actions(action_1)
action_2 = agent.act2(state, action_1, ava_actions) # calculate the similarity
state, reward, done, info = env.step(action_2)
This seems to be a hierarchical RL problem. the reward will be used to both updates the act1 and act2.
Since I am a novice in DRL, I do not know which agent is appropriate for such a situation (the trajectory length will be about 16, the total number of trajectories is about 200, agent explores 3 trajectories per iteration, the rewards are 0 except the final step T). A brute force way is to try every possible algorithm (e.g. DQN, PPO), but it seems that I need to customize the code of all the agents. My second question is, Is there any way to do this conveniently?
My first question is, which one is better with respect to the ‘Code simplicity’ and the 'learning performance'? It seems that I need to customize the "update" part for the agent for solution 2 since the input of the actor network is not just a state.
Should choose option2, here's an easy way:
- write a wrapper that changes obs to {"obs": obs, "ava_actions": actions}, so that the network input observation will be this batch of data, containing both "obs" and "ava_actions", no need to change other parts of code;
- customize your own network to accept both obs and ava_actions:
class Net(nn.Module):
def forward(self, s, **kwargs):
obs, ava_action = s.obs, s.ava_actions
....
return (...), None
My second question is, Is there any way to do this conveniently?
#403
My second question is, Is there any way to do this conveniently?
#403
Thank you for your quick reply!!
The solution in #403 seems to be Feudal Reinforcement Learning which is a variant of hierarchical RL models.
However, in my case, the problem is more likely to be hierarchical action space. To explain the hierarchical action space more clearly, there is an example in the paper Generalising Discrete Action Spaces with Conditional Action Trees. Figure2 in the paper shows that the actions are decomposed as an action tree. One should first select the first level actions, then select the second level actions. The action space of the first level is 3 and the action space of the second level depends on the first level. This example is very similar to my case
action_1 = agent.act1(state) # first level action
ava_actions = env.available_actions(action_1) # get the second level action space
action_2 = agent.act2(state, action_1, ava_actions) # calculate the similarity to choose the second level action
state, reward, done, info = env.step(action_2)
I try to give one possible solution to solve this.
write a wrapper that changes obs to {"obs": obs, "ava_actions": actions}
here, the "ava_actions" contains all of the possible second-level actions. Then
action1 = act1(state) # first level action
action2 = act2(obs = {"obs": obs, "ava_actions": actions}, a_1 )
{"obs": obs, "ava_actions": actions} = env.step(action2)
where act2 can be defined as:
class Net(nn.Module):
def forward(self, s, a_1):
obs, second_level_actions = s.obs, s.ava_actions[a_1] # second level action chosed by a_1
....
return (...), None
My question is, is there any better way to support such a conditional hierarchical action space?
This is called Factorized Action Space
. It was first introduced by Dota2 and StarCraft2 project from both openai and deepmind
I guess it needs to change a lot of code ...