rlpyt icon indicating copy to clipboard operation
rlpyt copied to clipboard

Multi-agent algorithm

Open kargarisaac opened this issue 4 years ago • 4 comments

I'm trying to use PPO_LSTM and R2D1 with a multi-agent environment. I was checking the other related issue but it seems it was more about DDPG and not recurrent models. So I ask my question here if it's ok.

I have one environment with 2 agents and in each env.step() only one agent acts. It is sequential. I tried to use PPO LSTM and tried to separate data from two agents from samples.env.observation and other things from the sampler based on agent index. But couldn't get good results. I'm not sure how to handle prev_rnn_state here. For example, I have data with the shape of [T, B, xx] and want to select data from agent1 and 2 and create [T/2, 2*B, xx] data and train one network using all of them. But my problem is with prev_rnn_state and bootstrap value. For bv, I have to modify it but I'm not sure about prev_rnn_state. I think I have to get these two different values from NN of each agent which is based on its history, and for that I think I have to modify collector to use two networks, right? And then change optimize_agent() method and agent code too.

I want to have one network and train it using trajectory data from both agents and put the network in both agents (maybe one agent with delayed weights) for self-play.

I wanted to see if you have any suggestions? I'm trying to change collector code to store data from two agents in for example in a 2-row array, instead of separating data later.

Update: I consider a trajectory of length T and use the first and last samples to calculate prev_action, prev_reward, and 'prev_rnn_state` and bootstrap value for two agents. Here is my code :

    def optimize_agent(self, itr, samples):
        recurrent = self.agent.recurrent

       if self.do_separate:
            # separate data for agent 1 and 2
            agent_index = samples.env.env_info.stats.agent_index + 1  # 1 or 2
            agent_index = (1 - samples.env.done.int().float()) * agent_index.float()  # 0 or 1 or 2 -> 0 for done
            agent_index = agent_index - 1  # -1, 0, 1 -> -1 for done
            agent_index = agent_index.type(torch.int)

            t, b, f = samples.env.observation.shape

            observation0 = torch.zeros([int((t - 2)/2), b, f])
            observation1 = torch.zeros_like(observation0)

            prev_action0 = torch.zeros([int((t - 2)/2), b, 3])
            prev_action1 = torch.zeros_like(prev_action0)

            prev_reward0 = torch.zeros([int((t - 2)/2), b])
            prev_reward1 = torch.zeros_like(prev_reward0)

            old_dist_info0_mean = torch.zeros_like(prev_action0)
            old_dist_info0_logstd = torch.zeros_like(old_dist_info0_mean)
            old_dist_info1_mean = torch.zeros_like(old_dist_info0_mean)
            old_dist_info1_logstd = torch.zeros_like(old_dist_info0_mean)

            action0 = torch.zeros_like(prev_action0)
            action1 = torch.zeros_like(action0)

            reward0 = torch.zeros_like(prev_reward0)
            reward1 = torch.zeros_like(reward0)

            done0 = torch.ones_like(reward0).bool()
            done1 = torch.ones_like(done0).bool()

            value0 = torch.zeros_like(reward0)
            value1 = torch.zeros_like(value0)

            bv0 = torch.zeros([1, b])
            bv1 = torch.zeros_like(bv0)

            if recurrent:
                # init_rnn_state0 =
                pass

            for c in range(b): #for loop on B dim
                cnt0 = 0
                cnt1 = 0
                for r in range(1, t - 1):
                    if agent_index[r, c] == 0:
                        prev_action0[cnt0, c, :] = samples.agent.prev_action[r - 1, c, :]
                        prev_reward0[cnt0, c] = samples.env.prev_reward[r - 1, c]
                        observation0[cnt0, c, :] = samples.env.observation[r, c, :]
                        old_dist_info0_mean[cnt0, c, :] = samples.agent.agent_info.dist_info.mean[r, c, :]
                        old_dist_info0_logstd[cnt0, c, :] = samples.agent.agent_info.dist_info.log_std[r, c, :]
                        action0[cnt0, c, :] = samples.agent.action[r, c, :]
                        reward0[cnt0, c] = samples.env.reward[r, c]
                        done0[cnt0, c] = 0
                        value0[cnt0, c] = samples.agent.agent_info.value[r, c]
                        cnt0 += 1

                    elif agent_index[r, c] == 1:
                        prev_action1[cnt1, c, :] = samples.agent.prev_action[r - 1, c, :]
                        prev_reward1[cnt1, c] = samples.env.prev_reward[r - 1, c]
                        observation1[cnt1, c, :] = samples.env.observation[r, c, :]
                        old_dist_info1_mean[cnt1, c, :] = samples.agent.agent_info.dist_info.mean[r, c, :]
                        old_dist_info1_logstd[cnt1, c, :] = samples.agent.agent_info.dist_info.log_std[r, c, :]
                        action1[cnt1, c, :] = samples.agent.action[r, c, :]
                        reward1[cnt1, c] = samples.env.reward[r, c]
                        done1[cnt1, c] = 0
                        value1[cnt1, c] = samples.agent.agent_info.value[r, c]
                        cnt1 += 1

                r2 = t - 1 #variable for setting bv. if it is not done, set bv and if it is done, go to prev row
                while True:
                    if agent_index[r2, c] == 0:
                        bv1[0, c] = samples.agent.bootstrap_value[0, c]
                        bv0[0, c] = samples.agent.agent_info.value[r2, c]
                        break
                    elif agent_index[r2, c] == 1:
                        bv0[0, c] = samples.agent.bootstrap_value[0, c]
                        bv1[0, c] = samples.agent.agent_info.value[r2, c]
                        break
                    else: #-1
                        r2 -= 1
                        if r2 < 0: break

            # cat data of two agents in b dim
            observation = torch.cat([observation0, observation1], axis=1)
            prev_action = torch.cat([prev_action0, prev_action1], axis=1)
            prev_reward = torch.cat([prev_reward0, prev_reward1], axis=1)
            old_dist_info_mean = torch.cat([old_dist_info0_mean, old_dist_info1_mean], axis=1)
            old_dist_info_logstd = torch.cat([old_dist_info0_logstd, old_dist_info1_logstd], axis=1)
            old_dist_info = DistInfoStd(mean=old_dist_info_mean, log_std=old_dist_info_logstd)
            action = torch.cat([action0, action1], axis=1)
            reward = torch.cat([reward0, reward1], axis=1)
            done = torch.cat([done0, done1], axis=1).type(reward.dtype)
            value = torch.cat([value0, value1], axis=1)
            bv = torch.cat([bv0, bv1], axis=1)

        else:
            observation = samples.env.observation
            prev_action = samples.agent.prev_action
            prev_reward = samples.env.prev_reward
            old_dist_info = samples.agent.agent_info.dist_info
            action = samples.agent.action
            reward = samples.env.reward
            done = samples.env.done
            done = done.type(reward.dtype)
            value = samples.agent.agent_info.value
            bv = samples.agent.bootstrap_value


        agent_inputs = AgentInputs(  # Move inputs to device once, index there.
            observation=observation,
            prev_action=prev_action,
            prev_reward=prev_reward,
        )
        agent_inputs = buffer_to(agent_inputs, device=self.agent.device)
        if hasattr(self.agent, "update_obs_rms"):
            self.agent.update_obs_rms(agent_inputs.observation)

        return_, advantage, valid = self.process_returns(reward, done, value, bv)
        
      ...

kargarisaac avatar May 13 '20 20:05 kargarisaac

After thinking more about multi-agent implementation, I think I have to have two networks for two agents and use the data from both agents to update one network and then copy its weights into the other network periodically. I also think that maybe having two separate buffer for two agents would be easier than one buffer. Because I have lstm, I think the networks should be separate to have different hidden states.

kargarisaac avatar May 15 '20 09:05 kargarisaac

Interesting setup! To have a separate RNN state for each agent and alternate which RNN state gets used/updated in agent.step(), you can actually use the already-existing AlternatingRecurrentAgentMixin in place of the usual RecurrentAgentMixin https://github.com/astooke/rlpyt/blob/668290d1ca94e9d193388a599d4f719bc3a23fba/rlpyt/agents/base.py#L306.

It's quite a coincidence that this would be useful to you :) The original purpose of that is to use with the alternating sampler, where there are two separate groups of parallel environments, with two separate groups of parallel agent RNN states, call them (A) and (B), separated along the batch dimension. Essentially, every time agent.step() is called, it alternates which half of the RNN state to use/advance (either A or B), separated along the batch dimension. Once your inside the algorithm optimization, the RNN data will be organized such that the A agents are in the first half batch indexes: agent_info.prev_rnn_state[:, :half_B], and the B agents are in the second half of batch indexes: agent_info.prev_rnn_state[:, half_B:] (where the leading dimension would be Time). So you can use that agent with the usual sampler (not the alternating sampler), and you should get the effect you're looking for. :)

Let us know if that works!

astooke avatar May 21 '20 01:05 astooke

@astooke Thank you for your answer. I'll try it and let you know.

kargarisaac avatar May 21 '20 06:05 kargarisaac

@astooke Hi, In addition to rnn state handling, I have a problem with lstm network. When I have two agents, I need to put a copy of the model into both of them to have different sampled trajectories with different hidden states. Do you have any suggestions where can I start to change to handle this problem? I'm thinking of separating trajectories of two agents after getting the samples from the sampler. In PPO and R2D1, I think at the beginning of the optimize_agent() method, before storing in the buffer in R2D1, I can separate the data for two agents. Because I want to have separate trajectories, I need to have different hidden states which are from two separate networks of two agents. So I need to use two copies of the trained model, with shared params between agents. Do you think the AlternatingRecurrentAgentMixin is enough or I need to use alternating sampling too? Thank you again

Update After exploring the code more, it seems that by having one agent inherited from AlternatingRecurrentAgentMixin, it works. In the PPO `agent.step()' method, the prev_rnn_state will switch and in each time step it uses the correct rnn_state for the corresponding agent. There is no need to have multiple models for multiple agents then, as long as we use the same and most recent parameters for two agents. I think if we want to have multiple versions of parameters for different agents, we can change the methods in PPO agent class to have several models and then handle updating model parameters as we want.

kargarisaac avatar May 24 '20 12:05 kargarisaac