tianshou.policy

class tianshou.policy.A2CPolicy(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Callable[], torch.distributions.distribution.Distribution], discount_factor: float = 0.99, vf_coef: float = 0.5, ent_coef: float = 0.01, max_grad_norm: Optional[float] = None, gae_lambda: float = 0.95, reward_normalization: bool = False, max_batchsize: int = 256, **kwargs: Any)[source]

Bases: tianshou.policy.modelfree.pg.PGPolicy

Implementation of Synchronous Advantage Actor-Critic. arXiv:1602.01783.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • critic (torch.nn.Module) – the critic network. (s -> V(s))

  • optim (torch.optim.Optimizer) – the optimizer for actor and critic network.

  • dist_fn (Callable[[], torch.distributions.Distribution]) – distribution class for computing the action.

  • discount_factor (float) – in [0, 1], defaults to 0.99.

  • vf_coef (float) – weight for value loss, defaults to 0.5.

  • ent_coef (float) – weight for entropy loss, defaults to 0.01.

  • max_grad_norm (float) – clipping gradients in back propagation, defaults to None.

  • gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation, defaults to 0.95.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint; defaults to 256.

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

Returns

A Batch which has 4 keys:

  • act the action.

  • logits the network’s raw output.

  • dist the action distribution.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any)Dict[str, List[float]][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

process_fn(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray)tianshou.data.batch.Batch[source]

Compute the discounted returns for each frame.

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

training: bool
class tianshou.policy.BasePolicy(observation_space: Optional[gym.spaces.space.Space] = None, action_space: Optional[gym.spaces.space.Space] = None)[source]

Bases: abc.ABC, torch.nn.modules.module.Module

The base class for any RL policy.

Tianshou aims to modularizing RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit BasePolicy.

A policy class typically has four parts:

  • __init__(): initialize the policy, including coping the target network and so on;

  • forward(): compute action with given observation;

  • process_fn(): pre-process data from the replay buffer (this function can interact with replay buffer);

  • learn(): update policy with a given batch of data.

Most of the policy needs a neural network to predict the action and an optimizer to optimize the policy. The rules of self-defined networks are:

  1. Input: observation “obs” (may be a numpy.ndarray, a torch.Tensor, a dict or any others), hidden state “state” (for RNN usage), and other information “info” provided by the environment.

  2. Output: some “logits”, the next hidden state “state”, and the intermediate result during policy forwarding procedure “policy”. The “logits” could be a tuple instead of a torch.Tensor. It depends on how the policy process the network output. For example, in PPO, the return of the network might be (mu, sigma), state for Gaussian policy. The “policy” can be a Batch of torch.Tensor or other things, which will be stored in the replay buffer, and can be accessed in the policy update process (e.g. in “policy.learn()”, the “batch.policy” is what you need).

Since BasePolicy inherits torch.nn.Module, you can use BasePolicy almost the same as torch.nn.Module, for instance, loading and saving the model:

torch.save(policy.state_dict(), "policy.pth")
policy.load_state_dict(torch.load("policy.pth"))
static compute_episodic_return(batch: tianshou.data.batch.Batch, v_s_: Optional[Union[numpy.ndarray, torch.Tensor]] = None, gamma: float = 0.99, gae_lambda: float = 0.95, rew_norm: bool = False)tianshou.data.batch.Batch[source]

Compute returns over given full-length episodes.

Implementation of Generalized Advantage Estimator (arXiv:1506.02438).

Parameters
  • batch (Batch) – a data batch which contains several full-episode data chronologically.

  • v_s (numpy.ndarray) – the value function of all next states \(V(s')\).

  • gamma (float) – the discount factor, should be in [0, 1], defaults to 0.99.

  • gae_lambda (float) – the parameter for Generalized Advantage Estimation, should be in [0, 1], defaults to 0.95.

  • rew_norm (bool) – normalize the reward to Normal(0, 1), defaults to False.

Returns

a Batch. The result will be stored in batch.returns as a numpy array with shape (bsz, ).

static compute_nstep_return(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray, target_q_fn: Callable[[tianshou.data.buffer.ReplayBuffer, numpy.ndarray], torch.Tensor], gamma: float = 0.99, n_step: int = 1, rew_norm: bool = False)tianshou.data.batch.Batch[source]

Compute n-step return for Q-learning targets.

\[G_t = \sum_{i = t}^{t + n - 1} \gamma^{i - t}(1 - d_i)r_i + \gamma^n (1 - d_{t + n}) Q_{\mathrm{target}}(s_{t + n})\]

where \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\), \(d_t\) is the done flag of step \(t\).

Parameters
  • batch (Batch) – a data batch, which is equal to buffer[indice].

  • buffer (ReplayBuffer) – a data buffer which contains several full-episode data chronologically.

  • indice (numpy.ndarray) – sampled timestep.

  • target_q_fn (function) – a function receives \(t+n-1\) step’s data and compute target Q value.

  • gamma (float) – the discount factor, should be in [0, 1], defaults to 0.99.

  • n_step (int) – the number of estimation step, should be an int greater than 0, defaults to 1.

  • rew_norm (bool) – normalize the reward to Normal(0, 1), defaults to False.

Returns

a Batch. The result will be stored in batch.returns as a torch.Tensor with the same shape as target_q_fn’s return tensor.

abstract forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

Returns

A Batch which MUST have the following keys:

  • act an numpy.ndarray or a torch.Tensor, the action over given batch data.

  • state a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy, None as default.

Other keys are user-defined. It depends on the algorithm. For example,

# some code
return Batch(logits=..., act=..., state=None, dist=...)

The keyword policy is reserved and the corresponding data will be stored into the replay buffer. For instance,

# some code
return Batch(..., policy=Batch(log_prob=dist.log_prob(act)))
# and in the sampled data batch, you can directly use
# batch.policy.log_prob to get your data.
abstract learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Mapping[str, Union[float, List[float]]][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

post_process_fn(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray)None[source]

Post-process the data from the provided replay buffer.

Typical usage is to update the sampling weight in prioritized experience replay. Used in update().

process_fn(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray)tianshou.data.batch.Batch[source]

Pre-process the data from the provided replay buffer.

Used in update(). Check out policy.process_fn for more information.

set_agent_id(agent_id: int)None[source]

Set self.agent_id = agent_id, for MARL.

training: bool
update(sample_size: int, buffer: Optional[tianshou.data.buffer.ReplayBuffer], **kwargs: Any)Mapping[str, Union[float, List[float]]][source]

Update the policy network and replay buffer.

It includes 3 function steps: process_fn, learn, and post_process_fn. In addition, this function will change the value of self.updating: it will be False before this function and will be True when executing update(). Please refer to States for policy for more detailed explanation.

Parameters
  • sample_size (int) – 0 means it will extract all the data from the buffer, otherwise it will sample a batch with given sample_size.

  • buffer (ReplayBuffer) – the corresponding replay buffer.

class tianshou.policy.C51Policy(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_atoms: int = 51, v_min: float = - 10.0, v_max: float = 10.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: tianshou.policy.modelfree.dqn.DQNPolicy

Implementation of Categorical Deep Q-Network. arXiv:1707.06887.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • num_atoms (int) – the number of atoms in the support set of the value distribution, defaults to 51.

  • v_min (float) – the value of the smallest atom in the support set, defaults to -10.0.

  • v_max (float) – the value of the largest atom in the support set, defaults to 10.0.

  • estimation_step (int) – greater than 1, the number of steps to look ahead.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

See also

Please refer to DQNPolicy for more detailed explanation.

compute_q_value(logits: torch.Tensor)torch.Tensor[source]

Compute the q value based on the network’s raw output logits.

learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

training: bool
class tianshou.policy.DDPGPolicy(actor: Optional[torch.nn.modules.module.Module], actor_optim: Optional[torch.optim.optimizer.Optimizer], critic: Optional[torch.nn.modules.module.Module], critic_optim: Optional[torch.optim.optimizer.Optimizer], action_range: Tuple[float, float], tau: float = 0.005, gamma: float = 0.99, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, reward_normalization: bool = False, ignore_done: bool = False, estimation_step: int = 1, **kwargs: Any)[source]

Bases: tianshou.policy.base.BasePolicy

Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic (torch.nn.Module) – the critic network. (s, a -> Q(s, a))

  • critic_optim (torch.optim.Optimizer) – the optimizer for critic network.

  • action_range (Tuple[float, float]) – the action range (minimum, maximum).

  • tau (float) – param for soft update of the target network, defaults to 0.005.

  • gamma (float) – discount factor, in [0, 1], defaults to 0.99.

  • exploration_noise (BaseNoise) – the exploration noise, add to the action, defaults to GaussianNoise(sigma=0.1).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

  • ignore_done (bool) – ignore the done flag while training the policy, defaults to False.

  • estimation_step (int) – greater than 1, the number of steps to look ahead.

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'actor', input: str = 'obs', **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

Returns

A Batch which has 2 keys:

  • act the action.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

process_fn(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray)tianshou.data.batch.Batch[source]

Pre-process the data from the provided replay buffer.

Used in update(). Check out policy.process_fn for more information.

set_exp_noise(noise: Optional[tianshou.exploration.random.BaseNoise])None[source]

Set the exploration noise.

sync_weight()None[source]

Soft-update the weight for the target network.

train(mode: bool = True)tianshou.policy.modelfree.ddpg.DDPGPolicy[source]

Set the module in training mode, except for the target network.

training: bool
class tianshou.policy.DQNPolicy(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: tianshou.policy.base.BasePolicy

Implementation of Deep Q Network. arXiv:1312.5602.

Implementation of Double Q-Learning. arXiv:1509.06461.

Implementation of Dueling DQN. arXiv:1511.06581 (the dueling DQN is implemented in the network side, not here).

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • estimation_step (int) – greater than 1, the number of steps to look ahead.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

See also

Please refer to BasePolicy for more detailed explanation.

compute_q_value(logits: torch.Tensor)torch.Tensor[source]

Compute the q value based on the network’s raw output logits.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'model', input: str = 'obs', **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Parameters

eps (float) – in [0, 1], for epsilon-greedy exploration method.

Returns

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

process_fn(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray)tianshou.data.batch.Batch[source]

Compute the n-step return for Q-learning targets.

More details can be found at compute_nstep_return().

set_eps(eps: float)None[source]

Set the eps for epsilon-greedy exploration.

sync_weight()None[source]

Synchronize the weight for the target network.

train(mode: bool = True)tianshou.policy.modelfree.dqn.DQNPolicy[source]

Set the module in training mode, except for the target network.

training: bool
class tianshou.policy.DiscreteBCQPolicy(model: torch.nn.modules.module.Module, imitator: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 8000, eval_eps: float = 0.001, unlikely_action_threshold: float = 0.3, imitation_logits_penalty: float = 0.01, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: tianshou.policy.modelfree.dqn.DQNPolicy

Implementation of discrete BCQ algorithm. arXiv:1910.01708.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> q_value)

  • imitator (torch.nn.Module) – a model following the rules in BasePolicy. (s -> imtation_logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • estimation_step (int) – greater than 1, the number of steps to look ahead.

  • target_update_freq (int) – the target network update frequency.

  • eval_eps (float) – the epsilon-greedy noise added in evaluation.

  • unlikely_action_threshold (float) – the threshold (tau) for unlikely actions, as shown in Equ. (17) in the paper, defaults to 0.3.

  • imitation_logits_penalty (float) – reguralization weight for imitation logits, defaults to 1e-2.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', eps: Optional[float] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Parameters

eps (float) – in [0, 1], for epsilon-greedy exploration method.

Returns

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

train(mode: bool = True)tianshou.policy.imitation.discrete_bcq.DiscreteBCQPolicy[source]

Set the module in training mode, except for the target network.

training: bool
class tianshou.policy.DiscreteSACPolicy(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer]] = 0.2, reward_normalization: bool = False, ignore_done: bool = False, estimation_step: int = 1, **kwargs: Any)[source]

Bases: tianshou.policy.modelfree.sac.SACPolicy

Implementation of SAC for Discrete Action Settings. arXiv:1910.07207.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s -> Q(s))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s -> Q(s))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • tau (float) – param for soft update of the target network, defaults to 0.005.

  • gamma (float) – discount factor, in [0, 1], defaults to 0.99.

  • torch.Tensor, torch.optim.Optimizer) or float alpha ((float,) – entropy regularization coefficient, default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatatically tuned.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

  • ignore_done (bool) – ignore the done flag while training the policy, defaults to False.

See also

Please refer to BasePolicy for more detailed explanation.

actor: torch.nn.modules.module.Module
actor_optim: torch.optim.optimizer.Optimizer
critic: torch.nn.modules.module.Module
critic_optim: torch.optim.optimizer.Optimizer
forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

Returns

A Batch which has 2 keys:

  • act the action.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

training: bool
class tianshou.policy.ImitationPolicy(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, mode: str = 'continuous', **kwargs: Any)[source]

Bases: tianshou.policy.base.BasePolicy

Implementation of vanilla imitation learning.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> a)

  • optim (torch.optim.Optimizer) – for optimizing the model.

  • mode (str) – indicate the imitation type (“continuous” or “discrete” action space), defaults to “continuous”.

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

Returns

A Batch which MUST have the following keys:

  • act an numpy.ndarray or a torch.Tensor, the action over given batch data.

  • state a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy, None as default.

Other keys are user-defined. It depends on the algorithm. For example,

# some code
return Batch(logits=..., act=..., state=None, dist=...)

The keyword policy is reserved and the corresponding data will be stored into the replay buffer. For instance,

# some code
return Batch(..., policy=Batch(log_prob=dist.log_prob(act)))
# and in the sampled data batch, you can directly use
# batch.policy.log_prob to get your data.
learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

training: bool
class tianshou.policy.MultiAgentPolicyManager(policies: List[tianshou.policy.base.BasePolicy], **kwargs: Any)[source]

Bases: tianshou.policy.base.BasePolicy

Multi-agent policy manager for MARL.

This multi-agent policy manager accepts a list of BasePolicy. It dispatches the batch data to each of these policies when the “forward” is called. The same as “process_fn” and “learn”: it splits the data and feeds them to each policy. A figure in Multi-Agent Reinforcement Learning can help you better understand this procedure.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch]] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Dispatch batch data from obs.agent_id to every policy’s forward.

Parameters

state – if None, it means all agents have no state. If not None, it should contain keys of “agent_1”, “agent_2”, …

Returns

a Batch with the following contents:

{
    "act": actions corresponding to the input
    "state": {
        "agent_1": output state of agent_1's policy for the state
        "agent_2": xxx
        ...
        "agent_n": xxx}
    "out": {
        "agent_1": output of agent_1's policy for the input
        "agent_2": xxx
        ...
        "agent_n": xxx}
}
learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, Union[float, List[float]]][source]

Dispatch the data to all policies for learning.

Returns

a dict with the following contents:

{
    "agent_1/item1": item 1 of agent_1's policy.learn output
    "agent_1/item2": item 2 of agent_1's policy.learn output
    "agent_2/xxx": xxx
    ...
    "agent_n/xxx": xxx
}
process_fn(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray)tianshou.data.batch.Batch[source]

Dispatch batch data from obs.agent_id to every policy’s process_fn.

Save original multi-dimensional rew in “save_rew”, set rew to the reward of each agent during their “process_fn”, and restore the original reward afterwards.

replace_policy(policy: tianshou.policy.base.BasePolicy, agent_id: int)None[source]

Replace the “agent_id”th policy in this manager.

training: bool
class tianshou.policy.PGPolicy(model: Optional[torch.nn.modules.module.Module], optim: torch.optim.optimizer.Optimizer, dist_fn: Callable[], torch.distributions.distribution.Distribution], discount_factor: float = 0.99, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: tianshou.policy.base.BasePolicy

Implementation of Vanilla Policy Gradient.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • dist_fn (Callable[[], torch.distributions.Distribution]) – distribution class for computing the action.

  • discount_factor (float) – in [0, 1].

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

Returns

A Batch which has 4 keys:

  • act the action.

  • logits the network’s raw output.

  • dist the action distribution.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any)Dict[str, List[float]][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

process_fn(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray)tianshou.data.batch.Batch[source]

Compute the discounted returns for each frame.

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

training: bool
class tianshou.policy.PPOPolicy(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Callable[], torch.distributions.distribution.Distribution], discount_factor: float = 0.99, max_grad_norm: Optional[float] = None, eps_clip: float = 0.2, vf_coef: float = 0.5, ent_coef: float = 0.01, action_range: Optional[Tuple[float, float]] = None, gae_lambda: float = 0.95, dual_clip: Optional[float] = None, value_clip: bool = True, reward_normalization: bool = True, max_batchsize: int = 256, **kwargs: Any)[source]

Bases: tianshou.policy.modelfree.pg.PGPolicy

Implementation of Proximal Policy Optimization. arXiv:1707.06347.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • critic (torch.nn.Module) – the critic network. (s -> V(s))

  • optim (torch.optim.Optimizer) – the optimizer for actor and critic network.

  • dist_fn (Callable[[], torch.distributions.Distribution]) – distribution class for computing the action.

  • discount_factor (float) – in [0, 1], defaults to 0.99.

  • max_grad_norm (float) – clipping gradients in back propagation, defaults to None.

  • eps_clip (float) – \(\epsilon\) in \(L_{CLIP}\) in the original paper, defaults to 0.2.

  • vf_coef (float) – weight for value loss, defaults to 0.5.

  • ent_coef (float) – weight for entropy loss, defaults to 0.01.

  • action_range ((float, float)) – the action range (minimum, maximum).

  • gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation, defaults to 0.95.

  • dual_clip (float) – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound, defaults to 5.0 (set None if you do not want to use it).

  • value_clip (bool) – a parameter mentioned in arXiv:1811.02553 Sec. 4.1, defaults to True.

  • reward_normalization (bool) – normalize the returns to Normal(0, 1), defaults to True.

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint; defaults to 256.

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

Returns

A Batch which has 4 keys:

  • act the action.

  • logits the network’s raw output.

  • dist the action distribution.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any)Dict[str, List[float]][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

model: torch.nn.modules.module.Module
process_fn(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.ReplayBuffer, indice: numpy.ndarray)tianshou.data.batch.Batch[source]

Compute the discounted returns for each frame.

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

training: bool
class tianshou.policy.PSRLPolicy(trans_count_prior: numpy.ndarray, rew_mean_prior: numpy.ndarray, rew_std_prior: numpy.ndarray, discount_factor: float = 0.99, epsilon: float = 0.01, add_done_loop: bool = False, **kwargs: Any)[source]

Bases: tianshou.policy.base.BasePolicy

Implementation of Posterior Sampling Reinforcement Learning.

Reference: Strens M. A Bayesian framework for reinforcement learning [C] //ICML. 2000, 2000: 943-950.

Parameters
  • trans_count_prior (np.ndarray) – dirichlet prior (alphas), with shape (n_state, n_action, n_state).

  • rew_mean_prior (np.ndarray) – means of the normal priors of rewards, with shape (n_state, n_action).

  • rew_std_prior (np.ndarray) – standard deviations of the normal priors of rewards, with shape (n_state, n_action).

  • discount_factor (float) – in [0, 1].

  • epsilon (float) – for precision control in value iteration.

  • add_done_loop (bool) – whether to add an extra self-loop for the terminal state in MDP, defaults to False.

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data with PSRL model.

Returns

A Batch with “act” key containing the action.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, *args: Any, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

training: bool
class tianshou.policy.QRDQNPolicy(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: tianshou.policy.modelfree.dqn.DQNPolicy

Implementation of Quantile Regression Deep Q-Network. arXiv:1710.10044.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value, defaults to 200.

  • estimation_step (int) – greater than 1, the number of steps to look ahead.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

See also

Please refer to DQNPolicy for more detailed explanation.

compute_q_value(logits: torch.Tensor)torch.Tensor[source]

Compute the q value based on the network’s raw output logits.

learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

training: bool
class tianshou.policy.RandomPolicy(observation_space: Optional[gym.spaces.space.Space] = None, action_space: Optional[gym.spaces.space.Space] = None)[source]

Bases: tianshou.policy.base.BasePolicy

A random agent used in multi-agent learning.

It randomly chooses an action from the legal action.

forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any)tianshou.data.batch.Batch[source]

Compute the random action over the given batch data.

The input should contain a mask in batch.obs, with “True” to be available and “False” to be unavailable. For example, batch.obs.mask == np.array([[False, True, False]]) means with batch size 1, action “1” is available but action “0” and “2” are unavailable.

Returns

A Batch with “act” key, containing the random action.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Since a random agent learn nothing, it returns an empty dict.

training: bool
class tianshou.policy.SACPolicy(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, action_range: Tuple[float, float], tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer]] = 0.2, reward_normalization: bool = False, ignore_done: bool = False, estimation_step: int = 1, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = None, deterministic_eval: bool = True, **kwargs: Any)[source]

Bases: tianshou.policy.modelfree.ddpg.DDPGPolicy

Implementation of Soft Actor-Critic. arXiv:1812.05905.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • action_range (Tuple[float, float]) – the action range (minimum, maximum).

  • tau (float) – param for soft update of the target network, defaults to 0.005.

  • gamma (float) – discount factor, in [0, 1], defaults to 0.99.

  • torch.Tensor, torch.optim.Optimizer) or float alpha ((float,) – entropy regularization coefficient, default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatatically tuned.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

  • ignore_done (bool) – ignore the done flag while training the policy, defaults to False.

  • exploration_noise (BaseNoise) – add a noise to action for exploration, defaults to None. This is useful when solving hard-exploration problem.

  • deterministic_eval (bool) – whether to use deterministic action (mean of Gaussian policy) instead of stochastic action sampled by the policy, defaults to True.

See also

Please refer to BasePolicy for more detailed explanation.

actor: torch.nn.modules.module.Module
actor_optim: torch.optim.optimizer.Optimizer
critic: torch.nn.modules.module.Module
critic_optim: torch.optim.optimizer.Optimizer
forward(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', **kwargs: Any)tianshou.data.batch.Batch[source]

Compute action over the given batch data.

Returns

A Batch which has 2 keys:

  • act the action.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

sync_weight()None[source]

Soft-update the weight for the target network.

train(mode: bool = True)tianshou.policy.modelfree.sac.SACPolicy[source]

Set the module in training mode, except for the target network.

training: bool
class tianshou.policy.TD3Policy(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, action_range: Tuple[float, float], tau: float = 0.005, gamma: float = 0.99, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, reward_normalization: bool = False, ignore_done: bool = False, estimation_step: int = 1, **kwargs: Any)[source]

Bases: tianshou.policy.modelfree.ddpg.DDPGPolicy

Implementation of TD3, arXiv:1802.09477.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • action_range (Tuple[float, float]) – the action range (minimum, maximum).

  • tau (float) – param for soft update of the target network, defaults to 0.005.

  • gamma (float) – discount factor, in [0, 1], defaults to 0.99.

  • exploration_noise (float) – the exploration noise, add to the action, defaults to GaussianNoise(sigma=0.1)

  • policy_noise (float) – the noise used in updating policy network, default to 0.2.

  • update_actor_freq (int) – the update frequency of actor network, default to 2.

  • noise_clip (float) – the clipping range used in updating policy network, default to 0.5.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to False.

  • ignore_done (bool) – ignore the done flag while training the policy, defaults to False.

See also

Please refer to BasePolicy for more detailed explanation.

actor: torch.nn.modules.module.Module
actor_optim: torch.optim.optimizer.Optimizer
critic: torch.nn.modules.module.Module
critic_optim: torch.optim.optimizer.Optimizer
learn(batch: tianshou.data.batch.Batch, **kwargs: Any)Dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict which includes loss and its corresponding label.

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

sync_weight()None[source]

Soft-update the weight for the target network.

train(mode: bool = True)tianshou.policy.modelfree.td3.TD3Policy[source]

Set the module in training mode, except for the target network.

training: bool