tianshou.policy

Base

class tianshou.policy.BasePolicy(observation_space: gymnasium.spaces.space.Space | None = None, action_space: gymnasium.spaces.space.Space | None = None, action_scaling: bool = False, action_bound_method: Optional[Literal['clip', 'tanh']] = None, lr_scheduler: torch.optim.lr_scheduler.LambdaLR | tianshou.utils.lr_scheduler.MultipleLRSchedulers | None = None)[source]

Bases: ABC, Module

The base class for any RL policy.

Tianshou aims to modularize RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit BasePolicy.

A policy class typically has the following parts:

  • __init__(): initialize the policy, including coping the target network and so on;

  • forward(): compute action with given observation;

  • process_fn(): pre-process data from the replay buffer (this function can interact with replay buffer);

  • learn(): update policy with a given batch of data.

  • post_process_fn(): update the replay buffer from the learning process (e.g., prioritized replay buffer needs to update the weight);

  • update(): the main interface for training, i.e., process_fn -> learn -> post_process_fn.

Most of the policy needs a neural network to predict the action and an optimizer to optimize the policy. The rules of self-defined networks are:

  1. Input: observation “obs” (may be a numpy.ndarray, a torch.Tensor, a dict or any others), hidden state “state” (for RNN usage), and other information “info” provided by the environment.

  2. Output: some “logits”, the next hidden state “state”, and the intermediate result during policy forwarding procedure “policy”. The “logits” could be a tuple instead of a torch.Tensor. It depends on how the policy process the network output. For example, in PPO, the return of the network might be (mu, sigma), state for Gaussian policy. The “policy” can be a Batch of torch.Tensor or other things, which will be stored in the replay buffer, and can be accessed in the policy update process (e.g. in “policy.learn()”, the “batch.policy” is what you need).

Since BasePolicy inherits torch.nn.Module, you can use BasePolicy almost the same as torch.nn.Module, for instance, loading and saving the model:

torch.save(policy.state_dict(), "policy.pth")
policy.load_state_dict(torch.load("policy.pth"))
Parameters
  • observation_space – appears unused.

  • action_space – required for action_scaling.

  • action_scaling – if True, scale the action from [-1, 1] to the range of action_space. Note that in this case, the action_space must be provided!

  • action_bound_method

  • lr_scheduler

set_agent_id(agent_id: int) None[source]

Set self.agent_id = agent_id, for MARL.

exploration_noise(act: numpy.ndarray | tianshou.data.batch.BatchProtocol, batch: RolloutBatchProtocol) numpy.ndarray | tianshou.data.batch.BatchProtocol[source]

Modify the action from policy.forward with exploration noise.

NOTE: currently does not add any noise! Needs to be overridden by subclasses to actually do something.

Parameters
  • act – a data batch or numpy.ndarray which is the action taken by policy.forward.

  • batch – the input batch for policy.forward, kept for advanced usage.

Returns

action in the same form of input “act” but with added exploration noise.

soft_update(tgt: Module, src: Module, tau: float) None[source]

Softly update the parameters of target module towards the parameters of source module.

abstract forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, **kwargs: Any) BatchProtocol[source]

Compute action over the given batch data.

Returns

A Batch which MUST have the following keys:

  • act an numpy.ndarray or a torch.Tensor, the action over given batch data.

  • state a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy, None as default.

Other keys are user-defined. It depends on the algorithm. For example,

# some code
return Batch(logits=..., act=..., state=None, dist=...)

The keyword policy is reserved and the corresponding data will be stored into the replay buffer. For instance,

# some code
return Batch(..., policy=Batch(log_prob=dist.log_prob(act)))
# and in the sampled data batch, you can directly use
# batch.policy.log_prob to get your data.

Note

In continuous action space, you should do another step “map_action” to get the real action:

act = policy(batch).act  # doesn't map to the target action range
act = policy.map_action(act, batch)
map_action(act: BatchProtocol) BatchProtocol[source]
map_action(act: ndarray) ndarray
map_action(act: Tensor) Tensor

Map raw network output to action range in gym’s env.action_space.

This function is called in collect() and only affects action sending to env. Remapped action will not be stored in buffer and thus can be viewed as a part of env (a black box action transformation).

Action mapping includes 2 standard procedures: bounding and scaling. Bounding procedure expects original action range is (-inf, inf) and maps it to [-1, 1], while scaling procedure expects original action range is (-1, 1) and maps it to [action_space.low, action_space.high]. Bounding procedure is applied first.

Parameters

act – a data batch or numpy.ndarray which is the action taken by policy.forward.

Returns

action in the same form of input “act” but remap to the target action space.

map_action_inverse(act: tianshou.data.batch.BatchProtocol | list | numpy.ndarray) tianshou.data.batch.BatchProtocol | list | numpy.ndarray[source]

Inverse operation to map_action().

This function is called in collect() for random initial steps. It scales [action_space.low, action_space.high] to the value ranges of policy.forward.

Parameters

act – a data batch, list or numpy.ndarray which is the action taken by gym.spaces.Box.sample().

Returns

action remapped.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) RolloutBatchProtocol[source]

Pre-process the data from the provided replay buffer.

Used in update(). Check out policy.process_fn for more information.

abstract learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, Any][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

post_process_fn(batch: BatchProtocol, buffer: ReplayBuffer, indices: ndarray) None[source]

Post-process the data from the provided replay buffer.

This will only have an effect if the buffer has the method update_weight and the batch has the attribute weight.

Typical usage is to update the sampling weight in prioritized experience replay. Used in update().

update(sample_size: int, buffer: tianshou.data.buffer.base.ReplayBuffer | None, **kwargs: Any) dict[str, Any][source]

Update the policy network and replay buffer.

It includes 3 function steps: process_fn, learn, and post_process_fn. In addition, this function will change the value of self.updating: it will be False before this function and will be True when executing update(). Please refer to States for policy for more detailed explanation.

Parameters
  • sample_size (int) – 0 means it will extract all the data from the buffer, otherwise it will sample a batch with given sample_size.

  • buffer (ReplayBuffer) – the corresponding replay buffer.

Returns

A dict, including the data needed to be logged (e.g., loss) from policy.learn().

static value_mask(buffer: ReplayBuffer, indices: ndarray) ndarray[source]

Value mask determines whether the obs_next of buffer[indices] is valid.

For instance, usually “obs_next” after “done” flag is considered to be invalid, and its q/advantage value can provide meaningless (even misleading) information, and should be set to 0 by hand. But if “done” flag is generated because timelimit of game length (info[“TimeLimit.truncated”] is set to True in gym’s settings), “obs_next” will instead be valid. Value mask is typically used for assisting in calculating the correct q/advantage value.

Parameters
  • buffer (ReplayBuffer) – the corresponding replay buffer.

  • indices (numpy.ndarray) – indices of replay buffer whose “obs_next” will be judged.

Returns

A bool type numpy.ndarray in the same shape with indices. “True” means “obs_next” of that buffer[indices] is valid.

static compute_episodic_return(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray, v_s_: numpy.ndarray | torch.Tensor | None = None, v_s: numpy.ndarray | torch.Tensor | None = None, gamma: float = 0.99, gae_lambda: float = 0.95) tuple[numpy.ndarray, numpy.ndarray][source]

Compute returns over given batch.

Use Implementation of Generalized Advantage Estimator (arXiv:1506.02438) to calculate q/advantage value of given batch. Returns are calculated as advantage + value, which is exactly equivalent to using \(TD(\lambda)\) for estimating returns.

Parameters
  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].

  • v_s (np.ndarray) – the value function of all next states \(V(s')\). If None, it will be set to an array of 0.

  • v_s – the value function of all current states \(V(s)\).

  • gamma (float) – the discount factor, should be in [0, 1]. Default to 0.99.

  • gae_lambda (float) – the parameter for Generalized Advantage Estimation, should be in [0, 1]. Default to 0.95.

Returns

two numpy arrays (returns, advantage) with each shape (bsz, ).

static compute_nstep_return(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray, target_q_fn: Callable[[ReplayBuffer, ndarray], Tensor], gamma: float = 0.99, n_step: int = 1, rew_norm: bool = False) BatchWithReturnsProtocol[source]

Compute n-step return for Q-learning targets.

\[G_t = \sum_{i = t}^{t + n - 1} \gamma^{i - t}(1 - d_i)r_i + \gamma^n (1 - d_{t + n}) Q_{\mathrm{target}}(s_{t + n})\]

where \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\), \(d_t\) is the done flag of step \(t\).

Parameters
  • batch (Batch) – a data batch, which is equal to buffer[indices].

  • buffer (ReplayBuffer) – the data buffer.

  • indices – tell batch’s location in buffer

  • target_q_fn (function) – a function which compute target Q value of “obs_next” given data buffer and wanted indices.

  • gamma (float) – the discount factor, should be in [0, 1]. Default to 0.99.

  • n_step (int) – the number of estimation step, should be an int greater than 0. Default to 1.

  • rew_norm (bool) – normalize the reward to Normal(0, 1), Default to False.

Returns

a Batch. The result will be stored in batch.returns as a torch.Tensor with the same shape as target_q_fn’s return tensor.

class tianshou.policy.RandomPolicy(observation_space: gymnasium.spaces.space.Space | None = None, action_space: gymnasium.spaces.space.Space | None = None, action_scaling: bool = False, action_bound_method: Optional[Literal['clip', 'tanh']] = None, lr_scheduler: torch.optim.lr_scheduler.LambdaLR | tianshou.utils.lr_scheduler.MultipleLRSchedulers | None = None)[source]

Bases: BasePolicy

A random agent used in multi-agent learning.

It randomly chooses an action from the legal action.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, **kwargs: Any) ActBatchProtocol[source]

Compute the random action over the given batch data.

The input should contain a mask in batch.obs, with “True” to be available and “False” to be unavailable. For example, batch.obs.mask == np.array([[False, True, False]]) means with batch size 1, action “1” is available but action “0” and “2” are unavailable.

Returns

A Batch with “act” key, containing the random action.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Since a random agent learns nothing, it returns an empty dict.

Model-free

DQN Family

class tianshou.policy.DQNPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, is_double: bool = True, clip_loss_grad: bool = False, **kwargs: Any)[source]

Bases: BasePolicy

Implementation of Deep Q Network. arXiv:1312.5602.

Implementation of Double Q-Learning. arXiv:1509.06461.

Implementation of Dueling DQN. arXiv:1511.06581 (the dueling DQN is implemented in the network side, not here).

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • is_double (bool) – use double dqn. Default to True.

  • clip_loss_grad (bool) – clip the gradient of the loss in accordance with nature14236; this amounts to using the Huber loss instead of the MSE loss. Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

set_eps(eps: float) None[source]

Set the eps for epsilon-greedy exploration.

train(mode: bool = True) DQNPolicy[source]

Set the module in training mode, except for the target network.

sync_weight() None[source]

Synchronize the weight for the target network.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) BatchWithReturnsProtocol[source]

Compute the n-step return for Q-learning targets.

More details can be found at compute_nstep_return().

compute_q_value(logits: Tensor, mask: numpy.ndarray | None) Tensor[source]

Compute the q value based on the network’s raw output and action mask.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, model: str = 'model', input: str = 'obs', **kwargs: Any) ModelOutputBatchProtocol[source]

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Returns

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

exploration_noise(act: numpy.ndarray | tianshou.data.batch.BatchProtocol, batch: RolloutBatchProtocol) numpy.ndarray | tianshou.data.batch.BatchProtocol[source]

Modify the action from policy.forward with exploration noise.

NOTE: currently does not add any noise! Needs to be overridden by subclasses to actually do something.

Parameters
  • act – a data batch or numpy.ndarray which is the action taken by policy.forward.

  • batch – the input batch for policy.forward, kept for advanced usage.

Returns

action in the same form of input “act” but with added exploration noise.

class tianshou.policy.BranchingDQNPolicy(model: BranchingNet, optim: Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, is_double: bool = True, **kwargs: Any)[source]

Bases: DQNPolicy

Implementation of the Branching dual Q network arXiv:1711.08946.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • is_double (bool) – use double network. Default to True.

See also

Please refer to BasePolicy for more detailed explanation.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) BatchWithReturnsProtocol[source]

Compute the 1-step return for BDQ targets.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, model: str = 'model', input: str = 'obs', **kwargs: Any) ModelOutputBatchProtocol[source]

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Returns

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

exploration_noise(act: numpy.ndarray | tianshou.data.batch.BatchProtocol, batch: RolloutBatchProtocol) numpy.ndarray | tianshou.data.batch.BatchProtocol[source]

Modify the action from policy.forward with exploration noise.

NOTE: currently does not add any noise! Needs to be overridden by subclasses to actually do something.

Parameters
  • act – a data batch or numpy.ndarray which is the action taken by policy.forward.

  • batch – the input batch for policy.forward, kept for advanced usage.

Returns

action in the same form of input “act” but with added exploration noise.

class tianshou.policy.C51Policy(model: Module, optim: Optimizer, discount_factor: float = 0.99, num_atoms: int = 51, v_min: float = -10.0, v_max: float = 10.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: DQNPolicy

Implementation of Categorical Deep Q-Network. arXiv:1707.06887.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • num_atoms (int) – the number of atoms in the support set of the value distribution. Default to 51.

  • v_min (float) – the value of the smallest atom in the support set. Default to -10.0.

  • v_max (float) – the value of the largest atom in the support set. Default to 10.0.

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to DQNPolicy for more detailed explanation.

compute_q_value(logits: Tensor, mask: numpy.ndarray | None) Tensor[source]

Compute the q value based on the network’s raw output and action mask.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.RainbowPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, num_atoms: int = 51, v_min: float = -10.0, v_max: float = 10.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: C51Policy

Implementation of Rainbow DQN. arXiv:1710.02298.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • num_atoms (int) – the number of atoms in the support set of the value distribution. Default to 51.

  • v_min (float) – the value of the smallest atom in the support set. Default to -10.0.

  • v_max (float) – the value of the largest atom in the support set. Default to 10.0.

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to C51Policy for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.QRDQNPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: DQNPolicy

Implementation of Quantile Regression Deep Q-Network. arXiv:1710.10044.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value. Default to 200.

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to DQNPolicy for more detailed explanation.

compute_q_value(logits: Tensor, mask: numpy.ndarray | None) Tensor[source]

Compute the q value based on the network’s raw output and action mask.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.IQNPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, sample_size: int = 32, online_sample_size: int = 8, target_sample_size: int = 8, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: QRDQNPolicy

Implementation of Implicit Quantile Network. arXiv:1806.06923.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • sample_size (int) – the number of samples for policy evaluation. Default to 32.

  • online_sample_size (int) – the number of samples for online model in training. Default to 8.

  • target_sample_size (int) – the number of samples for target model in training. Default to 8.

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to QRDQNPolicy for more detailed explanation.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, model: str = 'model', input: str = 'obs', **kwargs: Any) QuantileRegressionBatchProtocol[source]

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Returns

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.FQFPolicy(model: FullQuantileFunction, optim: Optimizer, fraction_model: FractionProposalNetwork, fraction_optim: Optimizer, discount_factor: float = 0.99, num_fractions: int = 32, ent_coef: float = 0.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: QRDQNPolicy

Implementation of Fully-parameterized Quantile Function. arXiv:1911.02140.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • fraction_model (FractionProposalNetwork) – a FractionProposalNetwork for proposing fractions/quantiles given state.

  • fraction_optim (torch.optim.Optimizer) – a torch.optim for optimizing the fraction model above.

  • discount_factor (float) – in [0, 1].

  • num_fractions (int) – the number of fractions to use. Default to 32.

  • ent_coef (float) – the coefficient for entropy loss. Default to 0.

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to QRDQNPolicy for more detailed explanation.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.Batch | numpy.ndarray | None = None, model: str = 'model', input: str = 'obs', fractions: tianshou.data.batch.Batch | None = None, **kwargs: Any) FQFBatchProtocol[source]

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Returns

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

On-policy

class tianshou.policy.PGPolicy(model: Module, optim: Optimizer, dist_fn: Callable[[torch.Tensor | tuple[torch.Tensor]], Distribution], discount_factor: float = 0.99, reward_normalization: bool = False, action_scaling: bool = True, action_bound_method: Optional[Literal['clip', 'tanh']] = 'clip', deterministic_eval: bool = False, **kwargs: Any)[source]

Bases: BasePolicy

Implementation of REINFORCE algorithm.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • dist_fn – distribution class for computing the action.

  • discount_factor (float) – in [0, 1]. Default to 0.99.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

  • deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.

See also

Please refer to BasePolicy for more detailed explanation.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) BatchWithReturnsProtocol[source]

Compute the discounted returns (Monte Carlo estimates) for each transition.

They are added to the batch under the field returns. Note: this function will modify the input batch!

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

Parameters
  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, **kwargs: Any) DistBatchProtocol[source]

Compute action over the given batch data by applying the actor.

Will sample from the dist_fn, if appropriate. Returns a new object representing the processed batch data (contrary to other methods that modify the input batch inplace).

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, batch_size: int, repeat: int, *args: Any, **kwargs: Any) dict[str, list[float]][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.NPGPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Callable[[torch.Tensor | tuple[torch.Tensor]], Distribution], advantage_normalization: bool = True, optim_critic_iters: int = 5, actor_step_size: float = 0.5, **kwargs: Any)[source]

Bases: A2CPolicy

Implementation of Natural Policy Gradient.

https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • critic (torch.nn.Module) – the critic network. (s -> V(s))

  • optim (torch.optim.Optimizer) – the optimizer for actor and critic network.

  • dist_fn – distribution class for computing the action.

  • advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.

  • optim_critic_iters (int) – Number of times to optimize critic network per update. Default to 5.

  • gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.

  • reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

  • deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) BatchWithAdvantagesProtocol[source]

Compute the discounted returns (Monte Carlo estimates) for each transition.

They are added to the batch under the field returns. Note: this function will modify the input batch!

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

Parameters
  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].

learn(batch: Batch, batch_size: int, repeat: int, **kwargs: Any) dict[str, list[float]][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.A2CPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Callable[[torch.Tensor | tuple[torch.Tensor]], Distribution], vf_coef: float = 0.5, ent_coef: float = 0.01, max_grad_norm: float | None = None, gae_lambda: float = 0.95, max_batchsize: int = 256, **kwargs: Any)[source]

Bases: PGPolicy

Implementation of Synchronous Advantage Actor-Critic. arXiv:1602.01783.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • critic (torch.nn.Module) – the critic network. (s -> V(s))

  • optim (torch.optim.Optimizer) – the optimizer for actor and critic network.

  • dist_fn – distribution class for computing the action.

  • discount_factor (float) – in [0, 1]. Default to 0.99.

  • vf_coef (float) – weight for value loss. Default to 0.5.

  • ent_coef (float) – weight for entropy loss. Default to 0.01.

  • max_grad_norm (float) – clipping gradients in back propagation. Default to None.

  • gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.

  • reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

  • deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.

See also

Please refer to BasePolicy for more detailed explanation.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) BatchWithAdvantagesProtocol[source]

Compute the discounted returns (Monte Carlo estimates) for each transition.

They are added to the batch under the field returns. Note: this function will modify the input batch!

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

Parameters
  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].

learn(batch: RolloutBatchProtocol, batch_size: int, repeat: int, *args: Any, **kwargs: Any) dict[str, list[float]][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.TRPOPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Callable[[torch.Tensor | tuple[torch.Tensor]], Distribution], max_kl: float = 0.01, backtrack_coeff: float = 0.8, max_backtracks: int = 10, **kwargs: Any)[source]

Bases: NPGPolicy

Implementation of Trust Region Policy Optimization. arXiv:1502.05477.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • critic (torch.nn.Module) – the critic network. (s -> V(s))

  • optim (torch.optim.Optimizer) – the optimizer for actor and critic network.

  • dist_fn – distribution class for computing the action.

  • advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.

  • optim_critic_iters (int) – Number of times to optimize critic network per update. Default to 5.

  • max_kl (int) – max kl-divergence used to constrain each actor network update. Default to 0.01.

  • backtrack_coeff (float) – Coefficient to be multiplied by step size when constraints are not met. Default to 0.8.

  • max_backtracks (int) – Max number of backtracking times in linesearch. Default to 10.

  • gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.

  • reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

  • deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.

learn(batch: Batch, batch_size: int, repeat: int, **kwargs: Any) dict[str, list[float]][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.PPOPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Callable[[torch.Tensor | tuple[torch.Tensor]], Distribution], eps_clip: float = 0.2, dual_clip: float | None = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, **kwargs: Any)[source]

Bases: A2CPolicy

Implementation of Proximal Policy Optimization. arXiv:1707.06347.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • critic (torch.nn.Module) – the critic network. (s -> V(s))

  • optim (torch.optim.Optimizer) – the optimizer for actor and critic network.

  • dist_fn – distribution class for computing the action.

  • discount_factor (float) – in [0, 1]. Default to 0.99.

  • eps_clip (float) – \(\epsilon\) in \(L_{CLIP}\) in the original paper. Default to 0.2.

  • dual_clip (float) – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound. Default to 5.0 (set None if you do not want to use it).

  • value_clip (bool) – a parameter mentioned in arXiv:1811.02553v3 Sec. 4.1. Default to True.

  • advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.

  • recompute_advantage (bool) – whether to recompute advantage every update repeat according to https://arxiv.org/pdf/2006.05990.pdf Sec. 3.5. Default to False.

  • vf_coef (float) – weight for value loss. Default to 0.5.

  • ent_coef (float) – weight for entropy loss. Default to 0.01.

  • max_grad_norm (float) – clipping gradients in back propagation. Default to None.

  • gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.

  • reward_normalization (bool) – normalize estimated values to have std close to 1, also normalize the advantage to Normal(0, 1). Default to False.

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

  • deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.

See also

Please refer to BasePolicy for more detailed explanation.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) LogpOldProtocol[source]

Compute the discounted returns (Monte Carlo estimates) for each transition.

They are added to the batch under the field returns. Note: this function will modify the input batch!

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

Parameters
  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].

learn(batch: RolloutBatchProtocol, batch_size: int, repeat: int, *args: Any, **kwargs: Any) dict[str, list[float]][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

Off-policy

class tianshou.policy.DDPGPolicy(actor: torch.nn.modules.module.Module | None, actor_optim: torch.optim.optimizer.Optimizer | None, critic: torch.nn.modules.module.Module | None, critic_optim: torch.optim.optimizer.Optimizer | None, tau: float = 0.005, gamma: float = 0.99, exploration_noise: tianshou.exploration.random.BaseNoise | None = <tianshou.exploration.random.GaussianNoise object>, reward_normalization: bool = False, estimation_step: int = 1, action_scaling: bool = True, action_bound_method: ~typing.Optional[~typing.Literal['clip', 'tanh']] = 'clip', **kwargs: ~typing.Any)[source]

Bases: BasePolicy

Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic (torch.nn.Module) – the critic network. (s, a -> Q(s, a))

  • critic_optim (torch.optim.Optimizer) – the optimizer for critic network.

  • tau (float) – param for soft update of the target network. Default to 0.005.

  • gamma (float) – discount factor, in [0, 1]. Default to 0.99.

  • exploration_noise (BaseNoise) – the exploration noise, add to the action. Default to GaussianNoise(sigma=0.1).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1), Default to False.

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

set_exp_noise(noise: tianshou.exploration.random.BaseNoise | None) None[source]

Set the exploration noise.

train(mode: bool = True) DDPGPolicy[source]

Set the module in training mode, except for the target network.

sync_weight() None[source]

Soft-update the weight for the target network.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) tianshou.data.types.RolloutBatchProtocol | tianshou.data.types.BatchWithReturnsProtocol[source]

Pre-process the data from the provided replay buffer.

Used in update(). Check out policy.process_fn for more information.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, model: Literal['actor', 'actor_old'] = 'actor', input: str = 'obs', **kwargs: Any) BatchProtocol[source]

Compute action over the given batch data.

Returns

A Batch which has 2 keys:

  • act the action.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

exploration_noise(act: numpy.ndarray | tianshou.data.batch.BatchProtocol, batch: RolloutBatchProtocol) numpy.ndarray | tianshou.data.batch.BatchProtocol[source]

Modify the action from policy.forward with exploration noise.

NOTE: currently does not add any noise! Needs to be overridden by subclasses to actually do something.

Parameters
  • act – a data batch or numpy.ndarray which is the action taken by policy.forward.

  • batch – the input batch for policy.forward, kept for advanced usage.

Returns

action in the same form of input “act” but with added exploration noise.

class tianshou.policy.TD3Policy(actor: ~torch.nn.modules.module.Module, actor_optim: ~torch.optim.optimizer.Optimizer, critic1: ~torch.nn.modules.module.Module, critic1_optim: ~torch.optim.optimizer.Optimizer, critic2: ~torch.nn.modules.module.Module, critic2_optim: ~torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, exploration_noise: tianshou.exploration.random.BaseNoise | None = <tianshou.exploration.random.GaussianNoise object>, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: ~typing.Any)[source]

Bases: DDPGPolicy

Implementation of TD3, arXiv:1802.09477.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • tau (float) – param for soft update of the target network. Default to 0.005.

  • gamma (float) – discount factor, in [0, 1]. Default to 0.99.

  • exploration_noise (float) – the exploration noise, add to the action. Default to GaussianNoise(sigma=0.1)

  • policy_noise (float) – the noise used in updating policy network. Default to 0.2.

  • update_actor_freq (int) – the update frequency of actor network. Default to 2.

  • noise_clip (float) – the clipping range used in updating policy network. Default to 0.5.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

train(mode: bool = True) TD3Policy[source]

Set the module in training mode, except for the target network.

sync_weight() None[source]

Soft-update the weight for the target network.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.SACPolicy(actor: Module, actor_optim: Optimizer, critic1: Module, critic1_optim: Optimizer, critic2: Module, critic2_optim: Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: float | tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, exploration_noise: tianshou.exploration.random.BaseNoise | None = None, deterministic_eval: bool = True, **kwargs: Any)[source]

Bases: DDPGPolicy

Implementation of Soft Actor-Critic. arXiv:1812.05905.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • tau (float) – param for soft update of the target network. Default to 0.005.

  • gamma (float) – discount factor, in [0, 1]. Default to 0.99.

  • alpha ((float, torch.Tensor, torch.optim.Optimizer) or float) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • exploration_noise (BaseNoise) – add a noise to action for exploration. Default to None. This is useful when solving hard-exploration problem.

  • deterministic_eval (bool) – whether to use deterministic action (mean of Gaussian policy) instead of stochastic action sampled by the policy. Default to True.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

train(mode: bool = True) SACPolicy[source]

Set the module in training mode, except for the target network.

sync_weight() None[source]

Soft-update the weight for the target network.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.Batch | numpy.ndarray | None = None, input: str = 'obs', **kwargs: Any) DistLogProbBatchProtocol[source]

Compute action over the given batch data.

Returns

A Batch which has 2 keys:

  • act the action.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.REDQPolicy(actor: Module, actor_optim: Optimizer, critics: Module, critics_optim: Optimizer, ensemble_size: int = 10, subset_size: int = 2, tau: float = 0.005, gamma: float = 0.99, alpha: float | tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, actor_delay: int = 20, exploration_noise: tianshou.exploration.random.BaseNoise | None = None, deterministic_eval: bool = True, target_mode: str = 'min', **kwargs: Any)[source]

Bases: DDPGPolicy

Implementation of REDQ. arXiv:2101.05982.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critics (torch.nn.Module) – critic ensemble networks.

  • critics_optim (torch.optim.Optimizer) – the optimizer for the critic networks.

  • ensemble_size (int) – Number of sub-networks in the critic ensemble. Default to 10.

  • subset_size (int) – Number of networks in the subset. Default to 2.

  • tau (float) – param for soft update of the target network. Default to 0.005.

  • gamma (float) – discount factor, in [0, 1]. Default to 0.99.

  • alpha ((float, torch.Tensor, torch.optim.Optimizer) or float) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • actor_delay (int) – Number of critic updates before an actor update. Default to 20.

  • exploration_noise (BaseNoise) – add a noise to action for exploration. Default to None. This is useful when solving hard-exploration problem.

  • deterministic_eval (bool) – whether to use deterministic action (mean of Gaussian policy) instead of stochastic action sampled by the policy. Default to True.

  • target_mode (str) – methods to integrate critic values in the subset, currently support minimum and average. Default to min.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

See also

Please refer to BasePolicy for more detailed explanation.

train(mode: bool = True) REDQPolicy[source]

Set the module in training mode, except for the target network.

sync_weight() None[source]

Soft-update the weight for the target network.

forward(batch: Batch, state: dict | tianshou.data.batch.Batch | numpy.ndarray | None = None, input: str = 'obs', **kwargs: Any) Batch[source]

Compute action over the given batch data.

Returns

A Batch which has 2 keys:

  • act the action.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.DiscreteSACPolicy(actor: Module, actor_optim: Optimizer, critic1: Module, critic1_optim: Optimizer, critic2: Module, critic2_optim: Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: float | tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: Any)[source]

Bases: SACPolicy

Implementation of SAC for Discrete Action Settings. arXiv:1910.07207.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s -> Q(s))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s -> Q(s))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • tau (float) – param for soft update of the target network. Default to 0.005.

  • gamma (float) – discount factor, in [0, 1]. Default to 0.99.

  • alpha ((float, torch.Tensor, torch.optim.Optimizer) or float) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, the alpha is automatically tuned.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: Batch, state: dict | tianshou.data.batch.Batch | numpy.ndarray | None = None, input: str = 'obs', **kwargs: Any) Batch[source]

Compute action over the given batch data.

Returns

A Batch which has 2 keys:

  • act the action.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

exploration_noise(act: numpy.ndarray | tianshou.data.batch.BatchProtocol, batch: RolloutBatchProtocol) numpy.ndarray | tianshou.data.batch.BatchProtocol[source]

Modify the action from policy.forward with exploration noise.

NOTE: currently does not add any noise! Needs to be overridden by subclasses to actually do something.

Parameters
  • act – a data batch or numpy.ndarray which is the action taken by policy.forward.

  • batch – the input batch for policy.forward, kept for advanced usage.

Returns

action in the same form of input “act” but with added exploration noise.

Imitation

class tianshou.policy.ImitationPolicy(model: Module, optim: Optimizer, **kwargs: Any)[source]

Bases: BasePolicy

Implementation of vanilla imitation learning.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> a)

  • optim (torch.optim.Optimizer) – for optimizing the model.

  • action_space (gym.Space) – env’s action space.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, **kwargs: Any) ModelOutputBatchProtocol[source]

Compute action over the given batch data.

Returns

A Batch which MUST have the following keys:

  • act an numpy.ndarray or a torch.Tensor, the action over given batch data.

  • state a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy, None as default.

Other keys are user-defined. It depends on the algorithm. For example,

# some code
return Batch(logits=..., act=..., state=None, dist=...)

The keyword policy is reserved and the corresponding data will be stored into the replay buffer. For instance,

# some code
return Batch(..., policy=Batch(log_prob=dist.log_prob(act)))
# and in the sampled data batch, you can directly use
# batch.policy.log_prob to get your data.

Note

In continuous action space, you should do another step “map_action” to get the real action:

act = policy(batch).act  # doesn't map to the target action range
act = policy.map_action(act, batch)
learn(batch: RolloutBatchProtocol, *ags: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.BCQPolicy(actor: Module, actor_optim: Optimizer, critic1: Module, critic1_optim: Optimizer, critic2: Module, critic2_optim: Optimizer, vae: VAE, vae_optim: Optimizer, device: str | torch.device = 'cpu', gamma: float = 0.99, tau: float = 0.005, lmbda: float = 0.75, forward_sampled_times: int = 100, num_sampled_action: int = 10, **kwargs: Any)[source]

Bases: BasePolicy

Implementation of BCQ algorithm. arXiv:1812.02900.

Parameters
  • actor (Perturbation) – the actor perturbation. (s, a -> perturbed a)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • vae (VAE) – the VAE network, generating actions similar to those in batch. (s, a -> generated a)

  • vae_optim (torch.optim.Optimizer) – the optimizer for the VAE network.

  • device (Union[str, torch.device]) – which device to create this model on. Default to “cpu”.

  • gamma (float) – discount factor, in [0, 1]. Default to 0.99.

  • tau (float) – param for soft update of the target network. Default to 0.005.

  • lmbda (float) – param for Clipped Double Q-learning. Default to 0.75.

  • forward_sampled_times (int) – the number of sampled actions in forward function. The policy samples many actions and takes the action with the max value. Default to 100.

  • num_sampled_action (int) – the number of sampled actions in calculating target Q. The algorithm samples several actions using VAE, and perturbs each action to get the target Q. Default to 10.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

train(mode: bool = True) BCQPolicy[source]

Set the module in training mode, except for the target network.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, **kwargs: Any) Batch[source]

Compute action over the given batch data.

sync_weight() None[source]

Soft-update the weight for the target network.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.CQLPolicy(actor: ActorProb, actor_optim: Optimizer, critic1: Module, critic1_optim: Optimizer, critic2: Module, critic2_optim: Optimizer, cql_alpha_lr: float = 0.0001, cql_weight: float = 1.0, tau: float = 0.005, gamma: float = 0.99, alpha: float | tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer] = 0.2, temperature: float = 1.0, with_lagrange: bool = True, lagrange_threshold: float = 10.0, min_action: float = -1.0, max_action: float = 1.0, num_repeat_actions: int = 10, alpha_min: float = 0.0, alpha_max: float = 1000000.0, clip_grad: float = 1.0, device: str | torch.device = 'cpu', **kwargs: Any)[source]

Bases: SACPolicy

Implementation of CQL algorithm. arXiv:2006.04779.

Parameters
  • actor (ActorProb) – the actor network following the rules in BasePolicy. (s -> a)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • cql_alpha_lr (float) – the learning rate of cql_log_alpha. Default to 1e-4.

  • cql_weight (float) – the value of alpha. Default to 1.0.

  • tau (float) – param for soft update of the target network. Default to 0.005.

  • gamma (float) – discount factor, in [0, 1]. Default to 0.99.

  • alpha ((float, torch.Tensor, torch.optim.Optimizer) or float) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.

  • temperature (float) – the value of temperature. Default to 1.0.

  • with_lagrange (bool) – whether to use Lagrange. Default to True.

  • lagrange_threshold (float) – the value of tau in CQL(Lagrange). Default to 10.0.

  • min_action (float) – The minimum value of each dimension of action. Default to -1.0.

  • max_action (float) – The maximum value of each dimension of action. Default to 1.0.

  • num_repeat_actions (int) – The number of times the action is repeated when calculating log-sum-exp. Default to 10.

  • alpha_min (float) – lower bound for clipping cql_alpha. Default to 0.0.

  • alpha_max (float) – upper bound for clipping cql_alpha. Default to 1e6.

  • clip_grad (float) – clip_grad for updating critic network. Default to 1.0.

  • device (Union[str, torch.device]) – which device to create this model on. Default to “cpu”.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

train(mode: bool = True) CQLPolicy[source]

Set the module in training mode, except for the target network.

sync_weight() None[source]

Soft-update the weight for the target network.

actor_pred(obs: Tensor) tuple[torch.Tensor, torch.Tensor][source]
calc_actor_loss(obs: Tensor) tuple[torch.Tensor, torch.Tensor][source]
calc_pi_values(obs_pi: Tensor, obs_to_pred: Tensor) tuple[torch.Tensor, torch.Tensor][source]
calc_random_values(obs: Tensor, act: Tensor) tuple[torch.Tensor, torch.Tensor][source]
process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) RolloutBatchProtocol[source]

Pre-process the data from the provided replay buffer.

Used in update(). Check out policy.process_fn for more information.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.TD3BCPolicy(actor: ~torch.nn.modules.module.Module, actor_optim: ~torch.optim.optimizer.Optimizer, critic1: ~torch.nn.modules.module.Module, critic1_optim: ~torch.optim.optimizer.Optimizer, critic2: ~torch.nn.modules.module.Module, critic2_optim: ~torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, exploration_noise: tianshou.exploration.random.BaseNoise | None = <tianshou.exploration.random.GaussianNoise object>, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, alpha: float = 2.5, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: ~typing.Any)[source]

Bases: TD3Policy

Implementation of TD3+BC. arXiv:2106.06860.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim (torch.optim.Optimizer) – the optimizer for actor network.

  • critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))

  • critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.

  • critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))

  • critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.

  • tau (float) – param for soft update of the target network. Default to 0.005.

  • gamma (float) – discount factor, in [0, 1]. Default to 0.99.

  • exploration_noise (float) – the exploration noise, add to the action. Default to GaussianNoise(sigma=0.1)

  • policy_noise (float) – the noise used in updating policy network. Default to 0.2.

  • update_actor_freq (int) – the update frequency of actor network. Default to 2.

  • noise_clip (float) – the clipping range used in updating policy network. Default to 0.5.

  • alpha (float) – the value of alpha, which controls the weight for TD3 learning relative to behavior cloning.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.DiscreteBCQPolicy(model: Module, imitator: Module, optim: Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 8000, eval_eps: float = 0.001, unlikely_action_threshold: float = 0.3, imitation_logits_penalty: float = 0.01, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: DQNPolicy

Implementation of discrete BCQ algorithm. arXiv:1910.01708.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> q_value)

  • imitator (torch.nn.Module) – a model following the rules in BasePolicy. (s -> imitation_logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency.

  • eval_eps (float) – the epsilon-greedy noise added in evaluation.

  • unlikely_action_threshold (float) – the threshold (tau) for unlikely actions, as shown in Equ. (17) in the paper. Default to 0.3.

  • imitation_logits_penalty (float) – regularization weight for imitation logits. Default to 1e-2.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

train(mode: bool = True) DiscreteBCQPolicy[source]

Set the module in training mode, except for the target network.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.Batch | numpy.ndarray | None = None, input: str = 'obs', **kwargs: Any) ImitationBatchProtocol[source]

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Returns

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.DiscreteCQLPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, min_q_weight: float = 10.0, **kwargs: Any)[source]

Bases: QRDQNPolicy

Implementation of discrete Conservative Q-Learning algorithm. arXiv:2006.04779.

Parameters
  • model (torch.nn.Module) – a model following the rules in BasePolicy. (s -> logits)

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1].

  • num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value. Default to 200.

  • estimation_step (int) – the number of steps to look ahead. Default to 1.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network).

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • min_q_weight (float) – the weight for the cql loss.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to QRDQNPolicy for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.DiscreteCRRPolicy(actor: Module, critic: Module, optim: Optimizer, discount_factor: float = 0.99, policy_improvement_mode: str = 'exp', ratio_upper_bound: float = 20.0, beta: float = 1.0, min_q_weight: float = 10.0, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]

Bases: PGPolicy

Implementation of discrete Critic Regularized Regression. arXiv:2006.15134.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • critic (torch.nn.Module) – the action-value critic (i.e., Q function) network. (s -> Q(s, *))

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • discount_factor (float) – in [0, 1]. Default to 0.99.

  • policy_improvement_mode (str) – type of the weight function f. Possible values: “binary”/”exp”/”all”. Default to “exp”.

  • ratio_upper_bound (float) – when policy_improvement_mode is “exp”, the value of the exp function is upper-bounded by this parameter. Default to 20.

  • beta (float) – when policy_improvement_mode is “exp”, this is the denominator of the exp function. Default to 1.

  • min_q_weight (float) – weight for CQL loss/regularizer. Default to 10.

  • target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.

  • reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to PGPolicy for more detailed explanation.

sync_weight() None[source]
learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.GAILPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Callable[[torch.Tensor | tuple[torch.Tensor]], Distribution], expert_buffer: ReplayBuffer, disc_net: Module, disc_optim: Optimizer, disc_update_num: int = 4, eps_clip: float = 0.2, dual_clip: float | None = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, **kwargs: Any)[source]

Bases: PPOPolicy

Implementation of Generative Adversarial Imitation Learning. arXiv:1606.03476.

Parameters
  • actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)

  • critic (torch.nn.Module) – the critic network. (s -> V(s))

  • optim (torch.optim.Optimizer) – the optimizer for actor and critic network.

  • dist_fn – distribution class for computing the action.

  • expert_buffer (ReplayBuffer) – the replay buffer contains expert experience.

  • disc_net (torch.nn.Module) – the discriminator network with input dim equals state dim plus action dim and output dim equals 1.

  • disc_optim (torch.optim.Optimizer) – the optimizer for the discriminator network.

  • disc_update_num (int) – the number of discriminator grad steps per model grad step. Default to 4.

  • discount_factor (float) – in [0, 1]. Default to 0.99.

  • eps_clip (float) – \(\epsilon\) in \(L_{CLIP}\) in the original paper. Default to 0.2.

  • dual_clip (float) – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound. Default to 5.0 (set None if you do not want to use it).

  • value_clip (bool) – a parameter mentioned in arXiv:1811.02553 Sec. 4.1. Default to True.

  • advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.

  • recompute_advantage (bool) – whether to recompute advantage every update repeat according to https://arxiv.org/pdf/2006.05990.pdf Sec. 3.5. Default to False.

  • vf_coef (float) – weight for value loss. Default to 0.5.

  • ent_coef (float) – weight for entropy loss. Default to 0.01.

  • max_grad_norm (float) – clipping gradients in back propagation. Default to None.

  • gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.

  • reward_normalization (bool) – normalize estimated values to have std close to 1, also normalize the advantage to Normal(0, 1). Default to False.

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.

  • action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.

  • action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.

  • action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

  • deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.

See also

Please refer to PPOPolicy for more detailed explanation.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) LogpOldProtocol[source]

Pre-process the data from the provided replay buffer.

Used in update(). Check out policy.process_fn for more information.

disc(batch: RolloutBatchProtocol) Tensor[source]
learn(batch: RolloutBatchProtocol, batch_size: int, repeat: int, **kwargs: Any) dict[str, list[float]][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

Model-based

class tianshou.policy.PSRLPolicy(trans_count_prior: ndarray, rew_mean_prior: ndarray, rew_std_prior: ndarray, discount_factor: float = 0.99, epsilon: float = 0.01, add_done_loop: bool = False, **kwargs: Any)[source]

Bases: BasePolicy

Implementation of Posterior Sampling Reinforcement Learning.

Reference: Strens M. A Bayesian framework for reinforcement learning [C] //ICML. 2000, 2000: 943-950.

Parameters
  • trans_count_prior (np.ndarray) – dirichlet prior (alphas), with shape (n_state, n_action, n_state).

  • rew_mean_prior (np.ndarray) – means of the normal priors of rewards, with shape (n_state, n_action).

  • rew_std_prior (np.ndarray) – standard deviations of the normal priors of rewards, with shape (n_state, n_action).

  • discount_factor (float) – in [0, 1].

  • epsilon (float) – for precision control in value iteration.

  • add_done_loop (bool) – whether to add an extra self-loop for the terminal state in MDP. Default to False.

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, **kwargs: Any) ActBatchProtocol[source]

Compute action over the given batch data with PSRL model.

Returns

A Batch with “act” key containing the action.

See also

Please refer to forward() for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class tianshou.policy.ICMPolicy(policy: BasePolicy, model: IntrinsicCuriosityModule, optim: Optimizer, lr_scale: float, reward_scale: float, forward_loss_weight: float, **kwargs: Any)[source]

Bases: BasePolicy

Implementation of Intrinsic Curiosity Module. arXiv:1705.05363.

Parameters
  • policy (BasePolicy) – a base policy to add ICM to.

  • model (IntrinsicCuriosityModule) – the ICM model.

  • optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.

  • lr_scale (float) – the scaling factor for ICM learning.

  • forward_loss_weight (float) – the weight for forward model loss.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

See also

Please refer to BasePolicy for more detailed explanation.

train(mode: bool = True) ICMPolicy[source]

Set the module in training mode.

forward(batch: RolloutBatchProtocol, state: dict | tianshou.data.batch.BatchProtocol | numpy.ndarray | None = None, **kwargs: Any) BatchProtocol[source]

Compute action over the given batch data by inner policy.

See also

Please refer to forward() for more detailed explanation.

exploration_noise(act: numpy.ndarray | tianshou.data.batch.BatchProtocol, batch: RolloutBatchProtocol) numpy.ndarray | tianshou.data.batch.BatchProtocol[source]

Modify the action from policy.forward with exploration noise.

NOTE: currently does not add any noise! Needs to be overridden by subclasses to actually do something.

Parameters
  • act – a data batch or numpy.ndarray which is the action taken by policy.forward.

  • batch – the input batch for policy.forward, kept for advanced usage.

Returns

action in the same form of input “act” but with added exploration noise.

set_eps(eps: float) None[source]

Set the eps for epsilon-greedy exploration.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) RolloutBatchProtocol[source]

Pre-process the data from the provided replay buffer.

Used in update(). Check out policy.process_fn for more information.

post_process_fn(batch: BatchProtocol, buffer: ReplayBuffer, indices: ndarray) None[source]

Post-process the data from the provided replay buffer.

Typical usage is to update the sampling weight in prioritized experience replay. Used in update().

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float][source]

Update policy with a given batch of data.

Returns

A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

Multi-agent

class tianshou.policy.MultiAgentPolicyManager(policies: list[tianshou.policy.base.BasePolicy], env: PettingZooEnv, **kwargs: Any)[source]

Bases: BasePolicy

Multi-agent policy manager for MARL.

This multi-agent policy manager accepts a list of BasePolicy. It dispatches the batch data to each of these policies when the “forward” is called. The same as “process_fn” and “learn”: it splits the data and feeds them to each policy. A figure in Multi-Agent Reinforcement Learning can help you better understand this procedure.

replace_policy(policy: BasePolicy, agent_id: int) None[source]

Replace the “agent_id”th policy in this manager.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indice: ndarray) BatchProtocol[source]

Dispatch batch data from obs.agent_id to every policy’s process_fn.

Save original multi-dimensional rew in “save_rew”, set rew to the reward of each agent during their “process_fn”, and restore the original reward afterwards.

exploration_noise(act: numpy.ndarray | tianshou.data.batch.BatchProtocol, batch: RolloutBatchProtocol) numpy.ndarray | tianshou.data.batch.BatchProtocol[source]

Add exploration noise from sub-policy onto act.

forward(batch: Batch, state: dict | tianshou.data.batch.Batch | None = None, **kwargs: Any) Batch[source]

Dispatch batch data from obs.agent_id to every policy’s forward.

Parameters

state – if None, it means all agents have no state. If not None, it should contain keys of “agent_1”, “agent_2”, …

Returns

a Batch with the following contents:

{
    "act": actions corresponding to the input
    "state": {
        "agent_1": output state of agent_1's policy for the state
        "agent_2": xxx
        ...
        "agent_n": xxx}
    "out": {
        "agent_1": output of agent_1's policy for the input
        "agent_2": xxx
        ...
        "agent_n": xxx}
}
learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) dict[str, float | list[float]][source]

Dispatch the data to all policies for learning.

Returns

a dict with the following contents:

{
    "agent_1/item1": item 1 of agent_1's policy.learn output
    "agent_1/item2": item 2 of agent_1's policy.learn output
    "agent_2/xxx": xxx
    ...
    "agent_n/xxx": xxx
}