Experiment#

Finally, we can assemble building blocks that we have came across in previous tutorials to conduct our first DRL experiment. In this experiment, we will use PPO algorithm to solve the classic CartPole task in Gym.

Experiment#

To conduct this experiment, we need the following building blocks.

  • Two vectorized environments, one for training and one for evaluation

  • A PPO agent

  • A replay buffer to store transition data

  • Two collectors to manage the data collecting process, one for training and one for evaluation

  • A trainer to manage the training loop

Let us do this step by step.

Preparation#

Firstly, install Tianshou if you haven’t installed it before.

Import libraries we might need later.

Hide code cell content
%%capture

import gymnasium as gym
import torch

from tianshou.data import Collector, VectorReplayBuffer
from tianshou.env import DummyVectorEnv
from tianshou.policy import BasePolicy, PPOPolicy
from tianshou.trainer import OnpolicyTrainer
from tianshou.utils.net.common import ActorCritic, Net
from tianshou.utils.net.discrete import Actor, Critic

device = "cuda" if torch.cuda.is_available() else "cpu"

Environment#

We create two vectorized environments both for training and testing. Since the execution time of CartPole is extremely short, there is no need to use multi-process wrappers and we simply use DummyVectorEnv.

env = gym.make("CartPole-v1")
train_envs = DummyVectorEnv([lambda: gym.make("CartPole-v1") for _ in range(20)])
test_envs = DummyVectorEnv([lambda: gym.make("CartPole-v1") for _ in range(10)])

Policy#

Next we need to initialize our PPO policy. PPO is an actor-critic-style on-policy algorithm, so we have to define the actor and the critic in PPO first.

The actor is a neural network that shares the same network head with the critic. Both networks’ input is the environment observation. The output of the actor is the action and the output of the critic is a single value, representing the value of the current policy.

Luckily, Tianshou already provides basic network modules that we can use in this experiment.

# net is the shared head of the actor and the critic
assert env.observation_space.shape is not None  # for mypy
assert isinstance(env.action_space, gym.spaces.Discrete)  # for mypy
net = Net(state_shape=env.observation_space.shape, hidden_sizes=[64, 64], device=device)
actor = Actor(preprocess_net=net, action_shape=env.action_space.n, device=device).to(device)
critic = Critic(preprocess_net=net, device=device).to(device)
actor_critic = ActorCritic(actor=actor, critic=critic)

# optimizer of the actor and the critic
optim = torch.optim.Adam(actor_critic.parameters(), lr=0.0003)

Once we have defined the actor, the critic and the optimizer. We can use them to construct our PPO agent. CartPole is a discrete action space problem, so the distribution of our action space can be a categorical distribution.

dist = torch.distributions.Categorical
policy: BasePolicy
policy = PPOPolicy(
    actor=actor,
    critic=critic,
    optim=optim,
    dist_fn=dist,
    action_space=env.action_space,
    deterministic_eval=True,
    action_scaling=False,
)

deterministic_eval=True means that we want to sample actions during training but we would like to always use the best action in evaluation. No randomness included.

Collector#

We can set up the collectors now. Train collector is used to collect and store training data, so an additional replay buffer has to be passed in.

train_collector = Collector(
    policy=policy,
    env=train_envs,
    buffer=VectorReplayBuffer(20000, len(train_envs)),
)
test_collector = Collector(policy=policy, env=test_envs)

We use VectorReplayBuffer here because it’s more efficient to collaborate with vectorized environments, you can simply consider VectorReplayBuffer as a a list of ordinary replay buffers.

Trainer#

Finally, we can use the trainer to help us set up the training loop.

result = OnpolicyTrainer(
    policy=policy,
    train_collector=train_collector,
    test_collector=test_collector,
    max_epoch=10,
    step_per_epoch=50000,
    repeat_per_collect=10,
    episode_per_test=10,
    batch_size=256,
    step_per_collect=2000,
    stop_fn=lambda mean_reward: mean_reward >= 195,
).run()

Results#

Print the training result.

result.pprint_asdict()
InfoStats
----------------------------------------
{'best_reward': 197.9,
 'best_reward_std': 9.554580053565934,
 'gradient_step': 160,
 'test_episode': 20,
 'test_step': 2072,
 'timing': {'test_time': 0.35474157333374023,
            'total_time': 10.61071491241455,
            'train_time': 10.25597333908081,
            'train_time_collect': 5.143781900405884,
            'train_time_update': 5.052731037139893,
            'update_speed': 4095.174452136811},
 'train_episode': 965,
 'train_step': 42000}

We can also test our trained agent.

# Let's watch its performance!
policy.eval()
result = test_collector.collect(n_episode=1, render=False)
print(f"Final episode reward: {result.returns.mean()}, length: {result.lens.mean()}")
Final episode reward: 206.0, length: 206.0