tianshou
tianshou copied to clipboard
Using wrapper or mask makes a great training but a terrible testing
Hello, I have a question when using a gym.ObservationWrapper
for training and testing as a mask. There are many actions in my env and some of them are unavailable in some state so I used wrapper as #645 . I created a custom ObservationWrapper
and used like this:
train_envs = [lambda: Wrapper(MyEnv(xxx)) for _ in range(10)]
test_envs = [lambda: Wrapper(MyEnv(xxx)) for _ in range(4)]
# then put them with SubprocVectorEnv into Collector
# and do RainbowDQN as the examples.
The training result really satisfies me but the testing result is always terrible (worse than the random). I've tried to test the training envs by hand-coded function and it also failed.
What's more, to prove there's no serious problem of my env, I just removed the Wrapper
and give some minus reward to the unavailable actions. It appears that it's working for both the training and testing envs but the result is definitily not as good as the Wrapper ones in training.
I wonder if there's something wrong of using Wrapper
. Is there any guidance? Thanks.
Have you tried to tune eps_test
? That really matters the performance of DQN-family policy.
Well, I set it to 0 because my env doesnt allow random in test. But it truly decreases in training.
But according to my experience, setting it to 0 actually hurts the performance because Q-learning needs some randomness to escape the local minimal. Did you ever try eps_test==0.01 or 0.001 even if your env are not allowed for randomness in production?
I will try this, thanks.
But I believe this isn't the actual problem because the test performs well (through not good enough) when there's no Wrapper even if eps_test=0. But when I added the Wrapper, agent always decides the same action just like the beginning of training. This shouldn't exist after thousands times of training.
I believe this should be a problem about code.🙃
Did you ever try eps_test==0.01 or 0.001
I've tried and it appears that everything works the same as eps_test == 0 and it looks like the agent has learned nothing in testing.
I believe this should be a problem about code.
I think so. What does your script look like? I guess maybe there's wrong argument to policy/collector -- since you have already tested with env/wrapper.
Here's my lite script.
lambda_envs = [lambda: Wrapper(MyEnv(xxx)) for _ in range(10)]
train_envs = SubprocVectorEnv(lambda_envs)
lambda_envs = [lambda: Wrapper(MyEnv(xxx)) for _ in range(4)]
test_envs = SubprocVectorEnv(lambda_envs)
def noisy_linear(x, y):
return NoisyLinear(x, y, config["noisy-std"])
net = Net(
obs_shape,
action_shape,
hidden_sizes=config["hidden-sizes"],
device=device,
softmax=True,
num_atoms=config["num-atoms"],
dueling_param=({
"linear_layer": noisy_linear
}, {
"linear_layer": noisy_linear
})
).to(device)
optim = torch.optim.Adam(net.parameters(), lr=config["lr"])
policy = RainbowPolicy(
net,
optim,
config["gamma"],
config["num-atoms"],
config["v-min"],
config["v-max"],
config["n-step"],
target_update_freq=config["target-update-freq"]
).to(device)
buf = PrioritizedVectorReplayBuffer(
config["buffer-size"],
buffer_num=len(train_envs),
alpha=config["alpha"],
beta=config["beta"],
weight_norm=True
)
train_collector = Collector(policy, train_envs, buf, exploration_noise=True)
test_collector = Collector(policy, test_envs, exploration_noise=True)
train_collector.collect(n_step=config['batch-size'] * len(train_envs))
logger = TensorboardLogger(writer)
if resume:
# load from existing checkpoint
print(f"Loading agent under {restore_dir}")
ckpt_path = os.path.join(restore_dir, 'checkpoint.pth')
if os.path.exists(ckpt_path):
checkpoint = torch.load(ckpt_path, map_location=device)
policy.load_state_dict(checkpoint['model'])
policy.optim.load_state_dict(checkpoint['optim'])
policy.optim.param_groups[0]['capturable'] = True
print("Successfully restore policy and optim.")
else:
print("Fail to restore policy and optim.")
buffer_path = os.path.join(restore_dir, 'train_buffer.pkl')
if os.path.exists(buffer_path):
train_collector.buffer = pickle.load(open(buffer_path, "rb"))
print("Successfully restore buffer.")
else:
print("Fail to restore buffer.")
result = offpolicy_trainer(
policy,
train_collector,
test_collector,
epoch,
config["step-per-epoch"],
config["step-per-collect"],
config["test-count"],
config["batch-size"],
update_per_step=config["update-per-step"],
train_fn=train_fn,
test_fn=lambda epoch, env_step: policy.set_eps(config["eps-test"]),
save_best_fn=lambda policy_temp: torch.save(policy_temp.state_dict(),
os.path.join(log_path, "policy_best.pth")),
logger=logger,
resume_from_log=resume,
save_checkpoint_fn=save_checkpoint_fn
)
When I test the agent, I start another python file and run till the resume part, then remove the offpolicy_trainer
and start my own test.
Have you played with NoisyLinear? This layer's training and testing behavior are not the same. See https://github.com/thu-ml/tianshou/blob/0f59e38b126f7fb7696b79e53c86cd7b321550cb/tianshou/utils/net/discrete.py#L369-L379
In fact, you can do a sanity check with the following:
policy.train()
print("with policy.training == True", test_collector.collect(n_episode=10))
policy.eval()
print("with policy.training == False", test_collector.collect(n_episode=10))
If the episodic reward in the above result has some difference, it is because of policy's different behavior during training and testing; otherwise, the reason should come from trainer (eps_train and eps_test).
There seem to be problems with the NoisyLinear
and my code. I have solved this problem. Thank you @Trinkle23897 .