stable-baselines3-contrib icon indicating copy to clipboard operation
stable-baselines3-contrib copied to clipboard

[Question] What modifications do I need to mask the inputs similar to how MaskablePPO masks the outputs?

Open rllyryan opened this issue 2 years ago • 0 comments

❓ Question

Hi sb3-contrib community,

Scenario Currently, I have ran into an issue whereby I am not able to generalize the length of my observations (1-dimensional Box) and actions (Discrete), as my environment changes in size in accordance to the data that was provided to populate it. Some context, I am trying to solve a slotting optimization problem in warehouses, whereby different warehouse can have varying number of bins and items. As you might have imagined, the actions have swaps as actions and the number of actions can differ greatly between warehouses. As for the inputs, the length also depends on the number of bins and items as I am providing essential information pertaining to the demand of each item, distance of bin from packing zone, and etc.

I have looked pretty much everywhere (to no avail), but ChatGPT has suggested the use of padding. And this is how it goes:

image

  1. Set the maximum length of observation and action
  2. Populate the environment
  3. Pad the observation and action if necessary
  4. Use masking in the actions and observations so that these padded values do not affect the backpropagation

I found the apply_masking function too:

def apply_masking(self, masks: Optional[np.ndarray]) -> None:
        """
        Eliminate ("mask out") chosen categorical outcomes by setting their probability to 0.
        :param masks: An optional boolean ndarray of compatible shape with the distribution.
            If True, the corresponding choice's logit value is preserved. If False, it is set
            to a large negative value, resulting in near 0 probability. If masks is None, any
            previously applied masking is removed, and the original logits are restored.
        """

        if masks is not None:
            device = self.logits.device
            self.masks = th.as_tensor(masks, dtype=th.bool, device=device).reshape(self.logits.shape)
            HUGE_NEG = th.tensor(-1e8, dtype=self.logits.dtype, device=device)

            logits = th.where(self.masks, self._original_logits, HUGE_NEG)
        else:
            self.masks = None
            logits = self._original_logits

        # Reinitialize with updated logits
        super().__init__(logits=logits)

        # self.probs may already be cached, so we must force an update
        self.probs = logits_to_probs(self.logits)

Edit: I found this in the sb3_contrib/common/maskable/policies.py file, can the apply_masking() method be applied to the features that is produced by the self.extract_features() method?

def forward(
        self,
        obs: th.Tensor,
        deterministic: bool = False,
        action_masks: Optional[np.ndarray] = None,
    ) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        """
        Forward pass in all the networks (actor and critic)
        :param obs: Observation
        :param deterministic: Whether to sample or use deterministic actions
        :param action_masks: Action masks to apply to the action distribution
        :return: action, value and log probability of the action
        """
        # Preprocess the observation if needed
        features = self.extract_features(obs)
        if self.share_features_extractor:
            latent_pi, latent_vf = self.mlp_extractor(features)
        else:
            pi_features, vf_features = features
            latent_pi = self.mlp_extractor.forward_actor(pi_features)
            latent_vf = self.mlp_extractor.forward_critic(vf_features)
        # Evaluate the values for the given observations
        values = self.value_net(latent_vf)
        distribution = self._get_action_dist_from_latent(latent_pi)
        if action_masks is not None:
            distribution.apply_masking(action_masks)
        actions = distribution.get_actions(deterministic=deterministic)
        log_prob = distribution.log_prob(actions)
        return actions, values, log_prob

Question/Request This brought my attention to MaskablePPO whereby the masking of actions have already been implemented, as such, I would like to ask for advice as to how I would modify the algorithm to mask the inputs as well. Please provide filepaths to the respective source code for modifications too! Thank you :)

Remarks If anyone has encountered this problem and solved it, could you share your insights as to how you do it? If anyone has any idea to generalized the inputs without the need of masking please do comment away!

Checklist

  • [X] I have checked that there is no similar issue in the repo
  • [X] I have read the documentation
  • [X] If code there is, it is minimal and working
  • [X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

rllyryan avatar Mar 21 '23 02:03 rllyryan