d3rlpy [FEATURE] Discrete IQL

I have implemented an IQL algorithm that supports discrete actions. And I have tested it in my local device and found it does work.

Below is my test code:

from d3rlpy.algos import DiscreteIQLConfig, DiscreteCQLConfig
from d3rlpy.datasets import get_cartpole
from d3rlpy.metrics import EnvironmentEvaluator

import os

os.chdir(os.path.dirname(os.path.abspath(__file__)))

def main():
    dataset, env = get_cartpole()

    iql = DiscreteIQLConfig().create(device="cpu")
    iql.build_with_dataset(dataset)
    iql.fit(
        dataset,
        n_steps=30000,
        evaluators={
            "environment": EnvironmentEvaluator(env),
        },
    )


if __name__ == "__main__":
    main()

Jul 15 '24 21:07 Mamba413

I also test it on LunarLander environment and find it surpasses DiscreteCQL when iteration is small.

from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import EpsilonGreedyHead
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy
import gym
import d3rlpy

import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))

# random state
random_state = 12345
device = "cpu"

# (0) Setup environment
env = gym.make("LunarLander-v2")

eval_env = gym.make("LunarLander-v2")

# (1) Learn a baseline policy in an online environment (using d3rlpy)
# initialize the algorithm
ddqn = DoubleDQNConfig().create(device=device)
# train an online policy
ddqn.fit_online(
    env,
    buffer=create_fifo_replay_buffer(limit=50000, env=env),
    explorer=ConstantEpsilonGreedy(epsilon=0.3),
    n_steps=1000000,
    update_start_step=10000,
    eval_env=eval_env, 
    save_interval=100000,
)
ddqn.save('ddqn_LunarLander.d3')

ddqn = d3rlpy.load_learnable('ddqn_LunarLander.d3')
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=random_state,
)
# initialize the dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=600,
)
# the behavior policy collects some logged data
train_logged_dataset = dataset.obtain_episodes(
  behavior_policies=behavior_policy,
  n_trajectories=1000,
  random_state=random_state,
)

from d3rlpy.dataset import MDPDataset
from d3rlpy.algos import DiscreteIQLConfig, DiscreteCQLConfig
from d3rlpy.metrics import EnvironmentEvaluator

# (3) Learning a new policy from offline logged data (using d3rlpy)
# convert the logged dataset into d3rlpy's dataset format
offlinerl_dataset = MDPDataset(
    observations=train_logged_dataset["state"],
    actions=train_logged_dataset["action"],
    rewards=train_logged_dataset["reward"],
    terminals=train_logged_dataset["done"],
)
# initialize the algorithm
cql = DiscreteCQLConfig().create(device=device)
# train an offline policy
cql.fit(
    offlinerl_dataset,
    n_steps=100000,
    save_interval=10000,
    evaluators={
        "environment": EnvironmentEvaluator(env),
    },
)

cql = DiscreteIQLConfig().create(device=device)
# train an offline policy
cql.fit(
    offlinerl_dataset,
    n_steps=100000,
    save_interval=10000,
    evaluators={
        "environment": EnvironmentEvaluator(env),
    },
)

Jul 16 '24 15:07 Mamba413

Hi @takuseno , let me first answer your last comment. As you can see from Table 10 in this paper: https://arxiv.org/pdf/2303.15810, Discrete IQL (D-IQL) surpasses Discrete-CQL (D-CQL) in 2/3 tasks.

On the other hand, Discrete sparse Q learning (D-SQL) has the best performance in Table 10. As the similarity between IQL and SQL, I also glad to implement SQL with the d3rlpy package.

Finally, I will modify the code soon.

Jul 19 '24 13:07 Mamba413

By the way, I believe the implementation of discrete-IQL can be further improved. Current implementation uses a stochastic policy that have to be updated; however, this update actually can be avoided like Discrete-CQL so as to gain higher computational efficiency. I haven't implemented such a more quick version as I feel this implementation is more complicated and I am not sufficiently understand the entire software design.

Jul 19 '24 15:07 Mamba413

https://arxiv.org/pdf/2303.15810, Discrete IQL (D-IQL) surpasses Discrete-CQL (D-CQL) in 2/3 tasks.

Ah, I didn't know that! Thank you for sharing this. Now, I'm happy to include DiscreteIQL (it'd be even nicer if you could add SQL as well :wink: ). I'm looking forward to the fix you're working on. Btw, the format check in CI complains about your change. Could you also try this before you finalize your PR?

pip install -r dev.requirements.txt
./scripts/format
./scripts/lint

Thanks!

Jul 20 '24 03:07 takuseno

By the way, I believe the implementation of discrete-IQL can be further improved. Current implementation uses a stochastic policy that have to be updated; however, this update actually can be avoided like Discrete-CQL so as to gain higher computational efficiency. I haven't implemented such a more quick version as I feel this implementation is more complicated and I am not sufficiently understand the entire software design.

Please do not worry about this. If there is a way to optimize your code, I can do that on my side.

Jul 20 '24 03:07 takuseno

https://arxiv.org/pdf/2303.15810, Discrete IQL (D-IQL) surpasses Discrete-CQL (D-CQL) in 2/3 tasks.

Ah, I didn't know that! Thank you for sharing this. Now, I'm happy to include DiscreteIQL (it'd be even nicer if you could add SQL as well 😉 ). I'm looking forward to the fix you're working on. Btw, the format check in CI complains about your change. Could you also try this before you finalize your PR?
pip install -r dev.requirements.txt
./scripts/format
./scripts/lint
Thanks!

I just update the code following your previous comment. I still have an unsolved problem when I conduct:

./scripts/lint

I found it returns many error:

tests/preprocessing/test_base.py:16: error: Unused "type: ignore" comment  [unused-ignore]
tests/dataset/test_trajectory_slicer.py:58: error: Unused "type: ignore" comment  [unused-ignore]
tests/dataset/test_trajectory_slicer.py:59: error: Unused "type: ignore" comment  [unused-ignore]
tests/dataset/test_trajectory_slicer.py:145: error: Unused "type: ignore" comment  [unused-ignore]
tests/dataset/test_trajectory_slicer.py:146: error: Unused "type: ignore" comment  [unused-ignore]
tests/dataset/test_mini_batch.py:95: error: Unused "type: ignore" comment  [unused-ignore]
tests/dataset/test_mini_batch.py:96: error: Unused "type: ignore" comment  [unused-ignore]
tests/dataset/test_mini_batch.py:97: error: Unused "type: ignore" comment  [unused-ignore]
tests/dataset/test_mini_batch.py:98: error: Unused "type: ignore" comment  [unused-ignore]
d3rlpy/algos/qlearning/torch/ddpg_impl.py:246: error: "ActionOutput" has no attribute "probs"  [attr-defined]
tests/algos/qlearning/test_random_policy.py:50: error: Unused "type: ignore" comment  [unused-ignore]
tests/algos/qlearning/test_random_policy.py:55: error: Unused "type: ignore" comment  [unused-ignore]
tests/envs/test_wrappers.py:29: error: Unused "type: ignore" comment  [unused-ignore]
tests/envs/test_wrappers.py:33: error: Unused "type: ignore" comment  [unused-ignore]
tests/envs/test_wrappers.py:51: error: Unused "type: ignore" comment  [unused-ignore]
tests/envs/test_wrappers.py:55: error: Unused "type: ignore" comment  [unused-ignore]

I have already addressed some of them but it still not clear how to address this line:

d3rlpy/algos/qlearning/torch/ddpg_impl.py:246: error: "ActionOutput" has no attribute "probs"  [attr-defined]

as it would make my implemented code has a lot of change and I am not sure whether these changes still make my code work.

Besides, I feel the following error message do not come from my modification? As I haven't modified test_wrappers.py file.

tests/envs/test_wrappers.py:55: error: Unused "type: ignore" comment  [unused-ignore]

Jul 22 '24 20:07 Mamba413