popgym icon indicating copy to clipboard operation
popgym copied to clipboard

Adds carflag-v0 env

Open ashok-arora opened this issue 1 year ago • 16 comments

Car Flag tasks a car with driving across a 1D line to the correct flag. The car must first drive to the oracle flag and then to the correct endpoint. The agent's observation is a vector of three floats: its position on the line, its velocity at each timestep, and the goal flag's location when it reaches the oracle flag. The agent's actions alter its velocity: it can accelerate left, perform a no-op (maintain current velocity), or accelerate right.

ashok-arora avatar Jun 29 '24 10:06 ashok-arora

Thank you for the detailed comments and support. I have added the easy, medium, and hard levels and updated the list of environments accordingly.

While writing the tests, I noticed a bug in the DTQN codebase. The env currently returns only a done state instead of differentiating between truncated and terminated states.

To address this, I observed that the minesweeper env code uses a combination of the timestep and board dimensions to calculate the maximum time it could take, which works well with the varying game levels since the dimensions change. Could you suggest a scalable method to calculate the truncated state across all levels in the carflag env?

ashok-arora avatar Jul 01 '24 07:07 ashok-arora

Yeah this is tricky. The minimum time required would be the time to reach maximum velocity, drive to the oracle, slow down to 0 velocity, accelerate to max velocity in the direction of the goal, then drive to the goal.

Also, we would probably want to be in "heaven" or "hell" for more than a single timestep to help with credit assignment.

Do you know how many timesteps they used in the DTQN paper? If so, maybe we can just scale that. For example, maybe the standard/easy mode would take a minimum of 20 timesteps to reach the oracle and then heaven, and DTQN gives the agent 40 timesteps, so 20 timesteps to remain in "heaven".

If the harder mode takes 60 timesteps to reach oracle and heaven, then we can set the episodes to be 120 timesteps long. These are just made up numbers.

smorad avatar Jul 03 '24 10:07 smorad

The DTQN paper has set the maximum_time_steps to be 200 for the easy mode.

https://github.com/kevslinger/DTQN/blob/79ccf8b548a2f6263b770e051b42ced2932137ee/envs/init.py#L43-L48

ashok-arora avatar Jul 03 '24 10:07 ashok-arora

Would it be better to set an arbitrary value or one based on the environment conditions?

ashok-arora avatar Jul 03 '24 13:07 ashok-arora

Ideally, I'd say it should be set based on environment conditions. How many timesteps would it take an optimal policy to reach the oracle and then heaven?

smorad avatar Jul 04 '24 10:07 smorad

Hey, I trained the DTQN policy and it took 85 timesteps for it to reach the oracle and then to heaven.

ashok-arora avatar Jul 05 '24 14:07 ashok-arora

Oh if you have a working DTQN implementation it's even easier! How many timesteps does it take on medium and hard?

smorad avatar Jul 06 '24 10:07 smorad

85-137 for easy, 125-175 for medium and 165-210 for hard.

ashok-arora avatar Jul 06 '24 11:07 ashok-arora

How does easy: 200, medium: 300, hard: 400 timesteps sound?

smorad avatar Jul 06 '24 11:07 smorad

Sounds good :+1:

ashok-arora avatar Jul 06 '24 11:07 ashok-arora

Hey @smorad, I am getting this error in the testcases but unable to figure out what's wrong:

=================================================================== short test summary info ===================================================================
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.4428458,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.3615731,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([1.020816, 0.      , 0.      ], dtype=float32), 0...
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [ 0.32249273 -0.0015     -1.        ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [ 0.73226799 -0.0015      0.        ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [-0.7669738  0.0015     0.       ]
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.10640713, 0.        , 0.        ]), 0.5, 1.0, ...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.65530812, -0.0015    ,  0.        ]), 0.5, -1...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.25203142, 0.        , 0.        ]), 0.5, -1.0,...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.50059146, 0.        , 0.        ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-1.0373735,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.86538035, 0.        , 0.        ], dtype=float...
=================================================== 12 failed, 441 passed, 6 skipped, 417 warnings in 5.21s ===================================================

I tried to np.clip the values as well yet I dont see how the values are out of bound. Could you please help?

ashok-arora avatar Aug 25 '24 17:08 ashok-arora

Ah I think this might be related to our Markovian state space requirement. So with POMDPs, it is useful to have the state available for debugging. I think you might need to add the state_space member variable to your class. See here for an example. You should also implement a get_state method that returns a Markov state (see here).

I think the issue is that state_space is defined to be an np.array of shape (3,) while you return a tuple of four elements from get_state.

smorad avatar Aug 26 '24 14:08 smorad

Thank you for the advice. I’ll look into the dimensions of get_state.

ashok-arora avatar Aug 27 '24 04:08 ashok-arora

Changing the get_state to return just the observation (of size 3) also gives the same error.

 def get_state(self):
        # Return the position of the car, oracle, and goal
        return self.state

ashok-arora avatar Aug 27 '24 17:08 ashok-arora

yay it passes the tests now! @smorad could you review the code quality? lastly, should I add new tests or would the existing ones suffice?

ashok-arora avatar Aug 27 '24 17:08 ashok-arora

Great work! Your code is good. I think I have just one more recommendation:

To be in-line with the other popgym environments, I think it would be best if the episodic reward (non discounted return) was always between -1.0, 1.0. Maybe you could rescale the reward to be

max_reward = self.max_steps - self.heaven_position / self.max_velocity
reward = reward / max_reward

As for tests, I would suggest testing that the episode ends after you reach the boundary, and that the episode return is always bounded in [-1, 1]. But it's up to you.

smorad avatar Aug 29 '24 10:08 smorad