tianshou Using wrapper or mask makes a great training but a terrible testing

Hello, I have a question when using a gym.ObservationWrapper for training and testing as a mask. There are many actions in my env and some of them are unavailable in some state so I used wrapper as #645 . I created a custom ObservationWrapper and used like this:

train_envs = [lambda: Wrapper(MyEnv(xxx)) for _ in range(10)]
test_envs = [lambda: Wrapper(MyEnv(xxx)) for _ in range(4)]
# then put them with SubprocVectorEnv into Collector
# and do RainbowDQN as the examples.

The training result really satisfies me but the testing result is always terrible (worse than the random). I've tried to test the training envs by hand-coded function and it also failed.

What's more, to prove there's no serious problem of my env, I just removed the Wrapper and give some minus reward to the unavailable actions. It appears that it's working for both the training and testing envs but the result is definitily not as good as the Wrapper ones in training.

I wonder if there's something wrong of using Wrapper. Is there any guidance? Thanks.

Jul 25 '22 13:07 lsylusiyao

Have you tried to tune eps_test? That really matters the performance of DQN-family policy.

Jul 25 '22 17:07 Trinkle23897

Well, I set it to 0 because my env doesnt allow random in test. But it truly decreases in training.

Jul 26 '22 00:07 lsylusiyao

But according to my experience, setting it to 0 actually hurts the performance because Q-learning needs some randomness to escape the local minimal. Did you ever try eps_test==0.01 or 0.001 even if your env are not allowed for randomness in production?

Jul 26 '22 00:07 Trinkle23897

I will try this, thanks.

But I believe this isn't the actual problem because the test performs well (through not good enough) when there's no Wrapper even if eps_test=0. But when I added the Wrapper, agent always decides the same action just like the beginning of training. This shouldn't exist after thousands times of training.

I believe this should be a problem about code.🙃

Jul 26 '22 00:07 lsylusiyao

Did you ever try eps_test==0.01 or 0.001

I've tried and it appears that everything works the same as eps_test == 0 and it looks like the agent has learned nothing in testing.

Jul 26 '22 06:07 lsylusiyao

I believe this should be a problem about code.

I think so. What does your script look like? I guess maybe there's wrong argument to policy/collector -- since you have already tested with env/wrapper.

Jul 27 '22 16:07 Trinkle23897

Here's my lite script.

lambda_envs = [lambda: Wrapper(MyEnv(xxx)) for _ in range(10)]
train_envs = SubprocVectorEnv(lambda_envs)
lambda_envs = [lambda: Wrapper(MyEnv(xxx)) for _ in range(4)]
test_envs = SubprocVectorEnv(lambda_envs)

def noisy_linear(x, y):
        return NoisyLinear(x, y, config["noisy-std"])
net = Net(
        obs_shape,
        action_shape,
        hidden_sizes=config["hidden-sizes"],
        device=device,
        softmax=True,
        num_atoms=config["num-atoms"],
        dueling_param=({
            "linear_layer": noisy_linear
        }, {
            "linear_layer": noisy_linear
        })
).to(device)
optim = torch.optim.Adam(net.parameters(), lr=config["lr"])
policy = RainbowPolicy(
        net,
        optim,
        config["gamma"],
        config["num-atoms"],
        config["v-min"],
        config["v-max"],
        config["n-step"],
        target_update_freq=config["target-update-freq"]
    ).to(device)

buf = PrioritizedVectorReplayBuffer(
            config["buffer-size"],
            buffer_num=len(train_envs),
            alpha=config["alpha"],
            beta=config["beta"],
            weight_norm=True
        )

train_collector = Collector(policy, train_envs, buf, exploration_noise=True)
test_collector = Collector(policy, test_envs, exploration_noise=True)
train_collector.collect(n_step=config['batch-size'] * len(train_envs))

logger = TensorboardLogger(writer)
if resume:
        # load from existing checkpoint
        print(f"Loading agent under {restore_dir}")
        ckpt_path = os.path.join(restore_dir, 'checkpoint.pth')
        if os.path.exists(ckpt_path):
            checkpoint = torch.load(ckpt_path, map_location=device)
            policy.load_state_dict(checkpoint['model'])
            policy.optim.load_state_dict(checkpoint['optim'])
            policy.optim.param_groups[0]['capturable'] = True
            print("Successfully restore policy and optim.")
        else:
            print("Fail to restore policy and optim.")
        buffer_path = os.path.join(restore_dir, 'train_buffer.pkl')
        if os.path.exists(buffer_path):
            train_collector.buffer = pickle.load(open(buffer_path, "rb"))
            print("Successfully restore buffer.")
        else:
            print("Fail to restore buffer.")

result = offpolicy_trainer(
        policy,
        train_collector,
        test_collector,
        epoch,
        config["step-per-epoch"],
        config["step-per-collect"],
        config["test-count"],
        config["batch-size"],
        update_per_step=config["update-per-step"],
        train_fn=train_fn,
        test_fn=lambda epoch, env_step: policy.set_eps(config["eps-test"]),
        save_best_fn=lambda policy_temp: torch.save(policy_temp.state_dict(), 
            os.path.join(log_path, "policy_best.pth")),
        logger=logger,
        resume_from_log=resume,
        save_checkpoint_fn=save_checkpoint_fn
    )

When I test the agent, I start another python file and run till the resume part, then remove the offpolicy_trainer and start my own test.

Jul 28 '22 06:07 lsylusiyao

Have you played with NoisyLinear? This layer's training and testing behavior are not the same. See https://github.com/thu-ml/tianshou/blob/0f59e38b126f7fb7696b79e53c86cd7b321550cb/tianshou/utils/net/discrete.py#L369-L379

In fact, you can do a sanity check with the following:

policy.train()
print("with policy.training == True", test_collector.collect(n_episode=10))
policy.eval()
print("with policy.training == False", test_collector.collect(n_episode=10))

If the episodic reward in the above result has some difference, it is because of policy's different behavior during training and testing; otherwise, the reason should come from trainer (eps_train and eps_test).

Aug 01 '22 03:08 Trinkle23897

There seem to be problems with the NoisyLinear and my code. I have solved this problem. Thank you @Trinkle23897 .

Aug 12 '22 12:08 lsylusiyao

tianshou tianshou copied to clipboard

Using wrapper or mask makes a great training but a terrible testing

tianshou
tianshou copied to clipboard