popgym
popgym copied to clipboard
Adds carflag-v0 env
Car Flag tasks a car with driving across a 1D line to the correct flag. The car must first drive to the oracle flag and then to the correct endpoint. The agent's observation is a vector of three floats: its position on the line, its velocity at each timestep, and the goal flag's location when it reaches the oracle flag. The agent's actions alter its velocity: it can accelerate left, perform a no-op (maintain current velocity), or accelerate right.
Thank you for the detailed comments and support. I have added the easy, medium, and hard levels and updated the list of environments accordingly.
While writing the tests, I noticed a bug in the DTQN codebase. The env currently returns only a done state instead of differentiating between truncated and terminated states.
To address this, I observed that the minesweeper env code uses a combination of the timestep and board dimensions to calculate the maximum time it could take, which works well with the varying game levels since the dimensions change. Could you suggest a scalable method to calculate the truncated state across all levels in the carflag env?
Yeah this is tricky. The minimum time required would be the time to reach maximum velocity, drive to the oracle, slow down to 0 velocity, accelerate to max velocity in the direction of the goal, then drive to the goal.
Also, we would probably want to be in "heaven" or "hell" for more than a single timestep to help with credit assignment.
Do you know how many timesteps they used in the DTQN paper? If so, maybe we can just scale that. For example, maybe the standard/easy mode would take a minimum of 20 timesteps to reach the oracle and then heaven, and DTQN gives the agent 40 timesteps, so 20 timesteps to remain in "heaven".
If the harder mode takes 60 timesteps to reach oracle and heaven, then we can set the episodes to be 120 timesteps long. These are just made up numbers.
The DTQN paper has set the maximum_time_steps to be 200 for the easy mode.
https://github.com/kevslinger/DTQN/blob/79ccf8b548a2f6263b770e051b42ced2932137ee/envs/init.py#L43-L48
Would it be better to set an arbitrary value or one based on the environment conditions?
Ideally, I'd say it should be set based on environment conditions. How many timesteps would it take an optimal policy to reach the oracle and then heaven?
Hey, I trained the DTQN policy and it took 85 timesteps for it to reach the oracle and then to heaven.
Oh if you have a working DTQN implementation it's even easier! How many timesteps does it take on medium and hard?
85-137 for easy, 125-175 for medium and 165-210 for hard.
How does easy: 200, medium: 300, hard: 400 timesteps sound?
Sounds good :+1:
Hey @smorad, I am getting this error in the testcases but unable to figure out what's wrong:
=================================================================== short test summary info ===================================================================
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagEasy] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([-0.4428458, 0. , 0. ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagMedium] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([-0.3615731, 0. , 0. ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagHard] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([1.020816, 0. , 0. ], dtype=float32), 0...
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagEasy] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data [ 0.32249273 -0.0015 -1. ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagMedium] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data [ 0.73226799 -0.0015 0. ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagHard] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data [-0.7669738 0.0015 0. ]
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagEasy] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([0.10640713, 0. , 0. ]), 0.5, 1.0, ...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagMedium] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([-0.65530812, -0.0015 , 0. ]), 0.5, -1...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagHard] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([0.25203142, 0. , 0. ]), 0.5, -1.0,...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagEasy] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([0.50059146, 0. , 0. ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagMedium] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([-1.0373735, 0. , 0. ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagHard] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([0.86538035, 0. , 0. ], dtype=float...
=================================================== 12 failed, 441 passed, 6 skipped, 417 warnings in 5.21s ===================================================
I tried to np.clip the values as well yet I dont see how the values are out of bound. Could you please help?
Ah I think this might be related to our Markovian state space requirement. So with POMDPs, it is useful to have the state available for debugging. I think you might need to add the state_space member variable to your class. See here for an example. You should also implement a get_state method that returns a Markov state (see here).
I think the issue is that state_space is defined to be an np.array of shape (3,) while you return a tuple of four elements from get_state.
Thank you for the advice. I’ll look into the dimensions of get_state.
Changing the get_state to return just the observation (of size 3) also gives the same error.
def get_state(self):
# Return the position of the car, oracle, and goal
return self.state
yay it passes the tests now! @smorad could you review the code quality? lastly, should I add new tests or would the existing ones suffice?
Great work! Your code is good. I think I have just one more recommendation:
To be in-line with the other popgym environments, I think it would be best if the episodic reward (non discounted return) was always between -1.0, 1.0. Maybe you could rescale the reward to be
max_reward = self.max_steps - self.heaven_position / self.max_velocity
reward = reward / max_reward
As for tests, I would suggest testing that the episode ends after you reach the boundary, and that the episode return is always bounded in [-1, 1]. But it's up to you.