tianshou.policy¶
Base¶
-
class
tianshou.policy.
BasePolicy
(observation_space: Optional[gym.spaces.space.Space] = None, action_space: Optional[gym.spaces.space.Space] = None, action_scaling: bool = False, action_bound_method: str = '')[source]¶ Bases:
abc.ABC
,torch.nn.modules.module.Module
The base class for any RL policy.
Tianshou aims to modularizing RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit
BasePolicy
.A policy class typically has the following parts:
__init__()
: initialize the policy, including coping the target network and so on;forward()
: compute action with given observation;process_fn()
: pre-process data from the replay buffer (this function can interact with replay buffer);learn()
: update policy with a given batch of data.post_process_fn()
: update the replay buffer from the learning process (e.g., prioritized replay buffer needs to update the weight);update()
: the main interface for training, i.e., process_fn -> learn -> post_process_fn.
Most of the policy needs a neural network to predict the action and an optimizer to optimize the policy. The rules of self-defined networks are:
Input: observation “obs” (may be a
numpy.ndarray
, atorch.Tensor
, a dict or any others), hidden state “state” (for RNN usage), and other information “info” provided by the environment.Output: some “logits”, the next hidden state “state”, and the intermediate result during policy forwarding procedure “policy”. The “logits” could be a tuple instead of a
torch.Tensor
. It depends on how the policy process the network output. For example, in PPO, the return of the network might be(mu, sigma), state
for Gaussian policy. The “policy” can be a Batch of torch.Tensor or other things, which will be stored in the replay buffer, and can be accessed in the policy update process (e.g. in “policy.learn()”, the “batch.policy” is what you need).
Since
BasePolicy
inheritstorch.nn.Module
, you can useBasePolicy
almost the same astorch.nn.Module
, for instance, loading and saving the model:torch.save(policy.state_dict(), "policy.pth") policy.load_state_dict(torch.load("policy.pth"))
-
exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
-
abstract
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
- Returns
A
Batch
which MUST have the following keys:act
an numpy.ndarray or a torch.Tensor, the action over given batch data.state
a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy,None
as default.
Other keys are user-defined. It depends on the algorithm. For example,
# some code return Batch(logits=..., act=..., state=None, dist=...)
The keyword
policy
is reserved and the corresponding data will be stored into the replay buffer. For instance,# some code return Batch(..., policy=Batch(log_prob=dist.log_prob(act))) # and in the sampled data batch, you can directly use # batch.policy.log_prob to get your data.
Note
In continuous action space, you should do another step “map_action” to get the real action:
act = policy(batch).act # doesn't map to the target action range act = policy.map_action(act, batch)
-
map_action
(act: Union[tianshou.data.batch.Batch, numpy.ndarray]) → Union[tianshou.data.batch.Batch, numpy.ndarray][source]¶ Map raw network output to action range in gym’s env.action_space.
This function is called in
collect()
and only affects action sending to env. Remapped action will not be stored in buffer and thus can be viewed as a part of env (a black box action transformation).Action mapping includes 2 standard procedures: bounding and scaling. Bounding procedure expects original action range is (-inf, inf) and maps it to [-1, 1], while scaling procedure expects original action range is (-1, 1) and maps it to [action_space.low, action_space.high]. Bounding procedure is applied first.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
- Returns
action in the same form of input “act” but remap to the target action space.
-
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Pre-process the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.
-
abstract
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, Any][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
post_process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → None[source]¶ Post-process the data from the provided replay buffer.
Typical usage is to update the sampling weight in prioritized experience replay. Used in
update()
.
-
update
(sample_size: int, buffer: Optional[tianshou.data.buffer.base.ReplayBuffer], **kwargs: Any) → Dict[str, Any][source]¶ Update the policy network and replay buffer.
It includes 3 function steps: process_fn, learn, and post_process_fn. In addition, this function will change the value of
self.updating
: it will be False before this function and will be True when executingupdate()
. Please refer to States for policy for more detailed explanation.- Parameters
sample_size (int) – 0 means it will extract all the data from the buffer, otherwise it will sample a batch with given sample_size.
buffer (ReplayBuffer) – the corresponding replay buffer.
- Returns
A dict, including the data needed to be logged (e.g., loss) from
policy.learn()
.
-
static
value_mask
(buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → numpy.ndarray[source]¶ Value mask determines whether the obs_next of buffer[indice] is valid.
For instance, usually “obs_next” after “done” flag is considered to be invalid, and its q/advantage value can provide meaningless (even misleading) information, and should be set to 0 by hand. But if “done” flag is generated because timelimit of game length (info[“TimeLimit.truncated”] is set to True in gym’s settings), “obs_next” will instead be valid. Value mask is typically used for assisting in calculating the correct q/advantage value.
- Parameters
buffer (ReplayBuffer) – the corresponding replay buffer.
indice (numpy.ndarray) – indices of replay buffer whose “obs_next” will be judged.
- Returns
A bool type numpy.ndarray in the same shape with indice. “True” means “obs_next” of that buffer[indice] is valid.
-
static
compute_episodic_return
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray, v_s_: Optional[Union[numpy.ndarray, torch.Tensor]] = None, v_s: Optional[Union[numpy.ndarray, torch.Tensor]] = None, gamma: float = 0.99, gae_lambda: float = 0.95) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ Compute returns over given batch.
Use Implementation of Generalized Advantage Estimator (arXiv:1506.02438) to calculate q/advantage value of given batch.
- Parameters
batch (Batch) – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recongized by buffer.unfinished_index().
indice (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indice].
v_s (np.ndarray) – the value function of all next states \(V(s')\).
gamma (float) – the discount factor, should be in [0, 1]. Default to 0.99.
gae_lambda (float) – the parameter for Generalized Advantage Estimation, should be in [0, 1]. Default to 0.95.
- Returns
two numpy arrays (returns, advantage) with each shape (bsz, ).
-
static
compute_nstep_return
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray, target_q_fn: Callable[[tianshou.data.buffer.base.ReplayBuffer, numpy.ndarray], torch.Tensor], gamma: float = 0.99, n_step: int = 1, rew_norm: bool = False) → tianshou.data.batch.Batch[source]¶ Compute n-step return for Q-learning targets.
\[G_t = \sum_{i = t}^{t + n - 1} \gamma^{i - t}(1 - d_i)r_i + \gamma^n (1 - d_{t + n}) Q_{\mathrm{target}}(s_{t + n})\]where \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\), \(d_t\) is the done flag of step \(t\).
- Parameters
batch (Batch) – a data batch, which is equal to buffer[indice].
buffer (ReplayBuffer) – the data buffer.
target_q_fn (function) – a function which compute target Q value of “obs_next” given data buffer and wanted indices.
gamma (float) – the discount factor, should be in [0, 1]. Default to 0.99.
n_step (int) – the number of estimation step, should be an int greater than 0. Default to 1.
rew_norm (bool) – normalize the reward to Normal(0, 1), Default to False.
- Returns
a Batch. The result will be stored in batch.returns as a torch.Tensor with the same shape as target_q_fn’s return tensor.
-
training
: bool¶
-
class
tianshou.policy.
RandomPolicy
(observation_space: Optional[gym.spaces.space.Space] = None, action_space: Optional[gym.spaces.space.Space] = None, action_scaling: bool = False, action_bound_method: str = '')[source]¶ Bases:
tianshou.policy.base.BasePolicy
A random agent used in multi-agent learning.
It randomly chooses an action from the legal action.
-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute the random action over the given batch data.
The input should contain a mask in batch.obs, with “True” to be available and “False” to be unavailable. For example,
batch.obs.mask == np.array([[False, True, False]])
means with batch size 1, action “1” is available but action “0” and “2” are unavailable.- Returns
A
Batch
with “act” key, containing the random action.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Since a random agent learns nothing, it returns an empty dict.
-
training
: bool¶
-
Model-free¶
DQN Family¶
-
class
tianshou.policy.
DQNPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, is_double: bool = True, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of Deep Q Network. arXiv:1312.5602.
Implementation of Double Q-Learning. arXiv:1509.06461.
Implementation of Dueling DQN. arXiv:1511.06581 (the dueling DQN is implemented in the network side, not here).
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
is_double (bool) – use double dqn. Default to True.
See also
Please refer to
BasePolicy
for more detailed explanation.-
train
(mode: bool = True) → tianshou.policy.modelfree.dqn.DQNPolicy[source]¶ Set the module in training mode, except for the target network.
-
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the n-step return for Q-learning targets.
More details can be found at
compute_nstep_return()
.
-
compute_q_value
(logits: torch.Tensor, mask: Optional[numpy.ndarray]) → torch.Tensor[source]¶ Compute the q value based on the network’s raw output and action mask.
-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'model', input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Parameters
eps (float) – in [0, 1], for epsilon-greedy exploration method.
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
-
training
: bool¶
-
class
tianshou.policy.
C51Policy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_atoms: int = 51, v_min: float = - 10.0, v_max: float = 10.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.dqn.DQNPolicy
Implementation of Categorical Deep Q-Network. arXiv:1707.06887.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_atoms (int) – the number of atoms in the support set of the value distribution. Default to 51.
v_min (float) – the value of the smallest atom in the support set. Default to -10.0.
v_max (float) – the value of the largest atom in the support set. Default to 10.0.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
DQNPolicy
for more detailed explanation.-
compute_q_value
(logits: torch.Tensor, mask: Optional[numpy.ndarray]) → torch.Tensor[source]¶ Compute the q value based on the network’s raw output and action mask.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
QRDQNPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.dqn.DQNPolicy
Implementation of Quantile Regression Deep Q-Network. arXiv:1710.10044.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value. Default to 200.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
DQNPolicy
for more detailed explanation.-
compute_q_value
(logits: torch.Tensor, mask: Optional[numpy.ndarray]) → torch.Tensor[source]¶ Compute the q value based on the network’s raw output and action mask.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
IQNPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, sample_size: int = 32, online_sample_size: int = 8, target_sample_size: int = 8, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.qrdqn.QRDQNPolicy
Implementation of Implicit Quantile Network. arXiv:1806.06923.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
sample_size (int) – the number of samples for policy evaluation. Default to 32.
online_sample_size (int) – the number of samples for online model in training. Default to 8.
target_sample_size (int) – the number of samples for target model in training. Default to 8.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
QRDQNPolicy
for more detailed explanation.-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'model', input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Parameters
eps (float) – in [0, 1], for epsilon-greedy exploration method.
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
FQFPolicy
(model: tianshou.utils.net.discrete.FullQuantileFunction, optim: torch.optim.optimizer.Optimizer, fraction_model: tianshou.utils.net.discrete.FractionProposalNetwork, fraction_optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_fractions: int = 32, ent_coef: float = 0.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.qrdqn.QRDQNPolicy
Implementation of Fully-parameterized Quantile Function. arXiv:1911.02140.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
fraction_model (FractionProposalNetwork) – a FractionProposalNetwork for proposing fractions/quantiles given state.
fraction_optim (torch.optim.Optimizer) – a torch.optim for optimizing the fraction model above.
discount_factor (float) – in [0, 1].
num_fractions (int) – the number of fractions to use. Default to 32.
ent_coef (float) – the coefficient for entropy loss. Default to 0.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
QRDQNPolicy
for more detailed explanation.-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'model', input: str = 'obs', fractions: Optional[tianshou.data.batch.Batch] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Parameters
eps (float) – in [0, 1], for epsilon-greedy exploration method.
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
On-policy¶
-
class
tianshou.policy.
PGPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], discount_factor: float = 0.99, reward_normalization: bool = False, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: Optional[torch.optim.lr_scheduler.LambdaLR] = None, deterministic_eval: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of REINFORCE algorithm.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.-
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).
-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
- Returns
A
Batch
which has 4 keys:act
the action.logits
the network’s raw output.dist
the action distribution.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
NPGPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], advantage_normalization: bool = True, optim_critic_iters: int = 5, actor_step_size: float = 0.5, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.a2c.A2CPolicy
Implementation of Natural Policy Gradient.
https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.
optim_critic_iters (int) – Number of times to optimize critic network per update. Default to 5.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
-
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).
-
learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
A2CPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], vf_coef: float = 0.5, ent_coef: float = 0.01, max_grad_norm: Optional[float] = None, gae_lambda: float = 0.95, max_batchsize: int = 256, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.pg.PGPolicy
Implementation of Synchronous Advantage Actor-Critic. arXiv:1602.01783.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
vf_coef (float) – weight for value loss. Default to 0.5.
ent_coef (float) – weight for entropy loss. Default to 0.01.
max_grad_norm (float) – clipping gradients in back propagation. Default to None.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.-
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).
-
learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
TRPOPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], max_kl: float = 0.01, backtrack_coeff: float = 0.8, max_backtracks: int = 10, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.npg.NPGPolicy
Implementation of Trust Region Policy Optimization. arXiv:1502.05477.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.
optim_critic_iters (int) – Number of times to optimize critic network per update. Default to 5.
max_kl (int) – max kl-divergence used to constrain each actor network update. Default to 0.01.
backtrack_coeff (float) – Coefficient to be multiplied by step size when constraints are not met. Default to 0.8.
max_backtracks (int) – Max number of backtracking times in linesearch. Default to 10.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
-
learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
PPOPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], eps_clip: float = 0.2, dual_clip: Optional[float] = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.a2c.A2CPolicy
Implementation of Proximal Policy Optimization. arXiv:1707.06347.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
eps_clip (float) – \(\epsilon\) in \(L_{CLIP}\) in the original paper. Default to 0.2.
dual_clip (float) – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound. Default to 5.0 (set None if you do not want to use it).
value_clip (bool) – a parameter mentioned in arXiv:1811.02553 Sec. 4.1. Default to True.
advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.
recompute_advantage (bool) – whether to recompute advantage every update repeat according to https://arxiv.org/pdf/2006.05990.pdf Sec. 3.5. Default to False.
vf_coef (float) – weight for value loss. Default to 0.5.
ent_coef (float) – weight for entropy loss. Default to 0.01.
max_grad_norm (float) – clipping gradients in back propagation. Default to None.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1, also normalize the advantage to Normal(0, 1). Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.-
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).
-
learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
Off-policy¶
-
class
tianshou.policy.
DDPGPolicy
(actor: Optional[torch.nn.modules.module.Module], actor_optim: Optional[torch.optim.optimizer.Optimizer], critic: Optional[torch.nn.modules.module.Module], critic_optim: Optional[torch.optim.optimizer.Optimizer], tau: float = 0.005, gamma: float = 0.99, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, reward_normalization: bool = False, estimation_step: int = 1, action_scaling: bool = True, action_bound_method: str = 'clip', **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic (torch.nn.Module) – the critic network. (s, a -> Q(s, a))
critic_optim (torch.optim.Optimizer) – the optimizer for critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
exploration_noise (BaseNoise) – the exploration noise, add to the action. Default to
GaussianNoise(sigma=0.1)
.reward_normalization (bool) – normalize the reward to Normal(0, 1), Default to False.
estimation_step (int) – the number of steps to look ahead. Default to 1.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
See also
Please refer to
BasePolicy
for more detailed explanation.-
set_exp_noise
(noise: Optional[tianshou.exploration.random.BaseNoise]) → None[source]¶ Set the exploration noise.
-
train
(mode: bool = True) → tianshou.policy.modelfree.ddpg.DDPGPolicy[source]¶ Set the module in training mode, except for the target network.
-
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Pre-process the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.
-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'actor', input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
- Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
-
training
: bool¶
-
class
tianshou.policy.
TD3Policy
(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.ddpg.DDPGPolicy
Implementation of TD3, arXiv:1802.09477.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
exploration_noise (float) – the exploration noise, add to the action. Default to
GaussianNoise(sigma=0.1)
policy_noise (float) – the noise used in updating policy network. Default to 0.2.
update_actor_freq (int) – the update frequency of actor network. Default to 2.
noise_clip (float) – the clipping range used in updating policy network. Default to 0.5.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
See also
Please refer to
BasePolicy
for more detailed explanation.-
train
(mode: bool = True) → tianshou.policy.modelfree.td3.TD3Policy[source]¶ Set the module in training mode, except for the target network.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
actor
: torch.nn.modules.module.Module¶
-
actor_optim
: torch.optim.optimizer.Optimizer¶
-
critic
: torch.nn.modules.module.Module¶
-
critic_optim
: torch.optim.optimizer.Optimizer¶
-
class
tianshou.policy.
SACPolicy
(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer]] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = None, deterministic_eval: bool = True, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.ddpg.DDPGPolicy
Implementation of Soft Actor-Critic. arXiv:1812.05905.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
torch.Tensor, torch.optim.Optimizer) or float alpha ((float,) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatatically tuned.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
exploration_noise (BaseNoise) – add a noise to action for exploration. Default to None. This is useful when solving hard-exploration problem.
deterministic_eval (bool) – whether to use deterministic action (mean of Gaussian policy) instead of stochastic action sampled by the policy. Default to True.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
See also
Please refer to
BasePolicy
for more detailed explanation.-
actor
: torch.nn.modules.module.Module¶
-
actor_optim
: torch.optim.optimizer.Optimizer¶
-
train
(mode: bool = True) → tianshou.policy.modelfree.sac.SACPolicy[source]¶ Set the module in training mode, except for the target network.
-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
- Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
-
training
: bool¶
-
critic
: torch.nn.modules.module.Module¶
-
critic_optim
: torch.optim.optimizer.Optimizer¶
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
class
tianshou.policy.
DiscreteSACPolicy
(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer]] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.sac.SACPolicy
Implementation of SAC for Discrete Action Settings. arXiv:1910.07207.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s -> Q(s))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s -> Q(s))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
torch.Tensor, torch.optim.Optimizer) or float alpha ((float,) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, the alpha is automatatically tuned.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
- Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
-
training
: bool¶
-
actor
: torch.nn.modules.module.Module¶
-
actor_optim
: torch.optim.optimizer.Optimizer¶
-
critic
: torch.nn.modules.module.Module¶
-
critic_optim
: torch.optim.optimizer.Optimizer¶
Imitation¶
-
class
tianshou.policy.
ImitationPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of vanilla imitation learning.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> a)optim (torch.optim.Optimizer) – for optimizing the model.
action_space (gym.Space) – env’s action space.
See also
Please refer to
BasePolicy
for more detailed explanation.-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
- Returns
A
Batch
which MUST have the following keys:act
an numpy.ndarray or a torch.Tensor, the action over given batch data.state
a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy,None
as default.
Other keys are user-defined. It depends on the algorithm. For example,
# some code return Batch(logits=..., act=..., state=None, dist=...)
The keyword
policy
is reserved and the corresponding data will be stored into the replay buffer. For instance,# some code return Batch(..., policy=Batch(log_prob=dist.log_prob(act))) # and in the sampled data batch, you can directly use # batch.policy.log_prob to get your data.
Note
In continuous action space, you should do another step “map_action” to get the real action:
act = policy(batch).act # doesn't map to the target action range act = policy.map_action(act, batch)
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
DiscreteBCQPolicy
(model: torch.nn.modules.module.Module, imitator: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 8000, eval_eps: float = 0.001, unlikely_action_threshold: float = 0.3, imitation_logits_penalty: float = 0.01, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.dqn.DQNPolicy
Implementation of discrete BCQ algorithm. arXiv:1910.01708.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> q_value)imitator (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> imtation_logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency.
eval_eps (float) – the epsilon-greedy noise added in evaluation.
unlikely_action_threshold (float) – the threshold (tau) for unlikely actions, as shown in Equ. (17) in the paper. Default to 0.3.
imitation_logits_penalty (float) – reguralization weight for imitation logits. Default to 1e-2.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.-
train
(mode: bool = True) → tianshou.policy.imitation.discrete_bcq.DiscreteBCQPolicy[source]¶ Set the module in training mode, except for the target network.
-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Parameters
eps (float) – in [0, 1], for epsilon-greedy exploration method.
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
DiscreteCQLPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, min_q_weight: float = 10.0, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.qrdqn.QRDQNPolicy
Implementation of discrete Conservative Q-Learning algorithm. arXiv:2006.04779.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value. Default to 200.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
min_q_weight (float) – the weight for the cql loss.
See also
Please refer to
QRDQNPolicy
for more detailed explanation.-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
-
class
tianshou.policy.
DiscreteCRRPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, policy_improvement_mode: str = 'exp', ratio_upper_bound: float = 20.0, beta: float = 1.0, min_q_weight: float = 10.0, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.pg.PGPolicy
Implementation of discrete Critic Regularized Regression. arXiv:2006.15134.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the action-value critic (i.e., Q function) network. (s -> Q(s, *))
optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1]. Default to 0.99.
policy_improvement_mode (str) – type of the weight function f. Possible values: “binary”/”exp”/”all”. Default to “exp”.
ratio_upper_bound (float) – when policy_improvement_mode is “exp”, the value of the exp function is upper-bounded by this parameter. Default to 20.
beta (float) – when policy_improvement_mode is “exp”, this is the denominator of the exp function. Default to 1.
min_q_weight (float) – weight for CQL loss/regularizer. Default to 10.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
PGPolicy
for more detailed explanation.-
training
: bool¶
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
Model-based¶
-
class
tianshou.policy.
PSRLPolicy
(trans_count_prior: numpy.ndarray, rew_mean_prior: numpy.ndarray, rew_std_prior: numpy.ndarray, discount_factor: float = 0.99, epsilon: float = 0.01, add_done_loop: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of Posterior Sampling Reinforcement Learning.
Reference: Strens M. A Bayesian framework for reinforcement learning [C] //ICML. 2000, 2000: 943-950.
- Parameters
trans_count_prior (np.ndarray) – dirichlet prior (alphas), with shape (n_state, n_action, n_state).
rew_mean_prior (np.ndarray) – means of the normal priors of rewards, with shape (n_state, n_action).
rew_std_prior (np.ndarray) – standard deviations of the normal priors of rewards, with shape (n_state, n_action).
discount_factor (float) – in [0, 1].
epsilon (float) – for precision control in value iteration.
add_done_loop (bool) – whether to add an extra self-loop for the terminal state in MDP. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data with PSRL model.
- Returns
A
Batch
with “act” key containing the action.
See also
Please refer to
forward()
for more detailed explanation.
-
learn
(batch: tianshou.data.batch.Batch, *args: Any, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
-
training
: bool¶
Multi-agent¶
-
class
tianshou.policy.
MultiAgentPolicyManager
(policies: List[tianshou.policy.base.BasePolicy], **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Multi-agent policy manager for MARL.
This multi-agent policy manager accepts a list of
BasePolicy
. It dispatches the batch data to each of these policies when the “forward” is called. The same as “process_fn” and “learn”: it splits the data and feeds them to each policy. A figure in Multi-Agent Reinforcement Learning can help you better understand this procedure.-
replace_policy
(policy: tianshou.policy.base.BasePolicy, agent_id: int) → None[source]¶ Replace the “agent_id”th policy in this manager.
-
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Dispatch batch data from obs.agent_id to every policy’s process_fn.
Save original multi-dimensional rew in “save_rew”, set rew to the reward of each agent during their “process_fn”, and restore the original reward afterwards.
-
exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Add exploration noise from sub-policy onto act.
-
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Dispatch batch data from obs.agent_id to every policy’s forward.
- Parameters
state – if None, it means all agents have no state. If not None, it should contain keys of “agent_1”, “agent_2”, …
- Returns
a Batch with the following contents:
{ "act": actions corresponding to the input "state": { "agent_1": output state of agent_1's policy for the state "agent_2": xxx ... "agent_n": xxx} "out": { "agent_1": output of agent_1's policy for the input "agent_2": xxx ... "agent_n": xxx} }
-
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, Union[float, List[float]]][source]¶ Dispatch the data to all policies for learning.
- Returns
a dict with the following contents:
{ "agent_1/item1": item 1 of agent_1's policy.learn output "agent_1/item2": item 2 of agent_1's policy.learn output "agent_2/xxx": xxx ... "agent_n/xxx": xxx }
-
training
: bool¶
-