tianshou.policy¶
Base¶
- class tianshou.policy.BasePolicy(observation_space: Optional[Space] = None, action_space: Optional[Space] = None, action_scaling: bool = False, action_bound_method: str = '', lr_scheduler: Optional[Union[LambdaLR, MultipleLRSchedulers]] = None)[source]¶
Bases:
ABC
,Module
The base class for any RL policy.
Tianshou aims to modularize RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit
BasePolicy
.A policy class typically has the following parts:
__init__()
: initialize the policy, including coping the target network and so on;forward()
: compute action with given observation;process_fn()
: pre-process data from the replay buffer (this function can interact with replay buffer);learn()
: update policy with a given batch of data.post_process_fn()
: update the replay buffer from the learning process (e.g., prioritized replay buffer needs to update the weight);update()
: the main interface for training, i.e., process_fn -> learn -> post_process_fn.
Most of the policy needs a neural network to predict the action and an optimizer to optimize the policy. The rules of self-defined networks are:
Input: observation “obs” (may be a
numpy.ndarray
, atorch.Tensor
, a dict or any others), hidden state “state” (for RNN usage), and other information “info” provided by the environment.Output: some “logits”, the next hidden state “state”, and the intermediate result during policy forwarding procedure “policy”. The “logits” could be a tuple instead of a
torch.Tensor
. It depends on how the policy process the network output. For example, in PPO, the return of the network might be(mu, sigma), state
for Gaussian policy. The “policy” can be a Batch of torch.Tensor or other things, which will be stored in the replay buffer, and can be accessed in the policy update process (e.g. in “policy.learn()”, the “batch.policy” is what you need).
Since
BasePolicy
inheritstorch.nn.Module
, you can useBasePolicy
almost the same astorch.nn.Module
, for instance, loading and saving the model:torch.save(policy.state_dict(), "policy.pth") policy.load_state_dict(torch.load("policy.pth"))
- exploration_noise(act: Union[ndarray, Batch], batch: Batch) Union[ndarray, Batch] [source]¶
Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
- soft_update(tgt: Module, src: Module, tau: float) None [source]¶
Softly update the parameters of target module towards the parameters of source module.
- abstract forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
- Returns
A
Batch
which MUST have the following keys:act
an numpy.ndarray or a torch.Tensor, the action over given batch data.state
a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy,None
as default.
Other keys are user-defined. It depends on the algorithm. For example,
# some code return Batch(logits=..., act=..., state=None, dist=...)
The keyword
policy
is reserved and the corresponding data will be stored into the replay buffer. For instance,# some code return Batch(..., policy=Batch(log_prob=dist.log_prob(act))) # and in the sampled data batch, you can directly use # batch.policy.log_prob to get your data.
Note
In continuous action space, you should do another step “map_action” to get the real action:
act = policy(batch).act # doesn't map to the target action range act = policy.map_action(act, batch)
- map_action(act: Union[Batch, ndarray]) Union[Batch, ndarray] [source]¶
Map raw network output to action range in gym’s env.action_space.
This function is called in
collect()
and only affects action sending to env. Remapped action will not be stored in buffer and thus can be viewed as a part of env (a black box action transformation).Action mapping includes 2 standard procedures: bounding and scaling. Bounding procedure expects original action range is (-inf, inf) and maps it to [-1, 1], while scaling procedure expects original action range is (-1, 1) and maps it to [action_space.low, action_space.high]. Bounding procedure is applied first.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
- Returns
action in the same form of input “act” but remap to the target action space.
- map_action_inverse(act: Union[Batch, List, ndarray]) Union[Batch, List, ndarray] [source]¶
Inverse operation to
map_action()
.This function is called in
collect()
for random initial steps. It scales [action_space.low, action_space.high] to the value ranges of policy.forward.- Parameters
act – a data batch, list or numpy.ndarray which is the action taken by gym.spaces.Box.sample().
- Returns
action remapped.
- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Pre-process the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.
- abstract learn(batch: Batch, **kwargs: Any) Dict[str, Any] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- post_process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) None [source]¶
Post-process the data from the provided replay buffer.
Typical usage is to update the sampling weight in prioritized experience replay. Used in
update()
.
- update(sample_size: int, buffer: Optional[ReplayBuffer], **kwargs: Any) Dict[str, Any] [source]¶
Update the policy network and replay buffer.
It includes 3 function steps: process_fn, learn, and post_process_fn. In addition, this function will change the value of
self.updating
: it will be False before this function and will be True when executingupdate()
. Please refer to States for policy for more detailed explanation.- Parameters
sample_size (int) – 0 means it will extract all the data from the buffer, otherwise it will sample a batch with given sample_size.
buffer (ReplayBuffer) – the corresponding replay buffer.
- Returns
A dict, including the data needed to be logged (e.g., loss) from
policy.learn()
.
- static value_mask(buffer: ReplayBuffer, indices: ndarray) ndarray [source]¶
Value mask determines whether the obs_next of buffer[indices] is valid.
For instance, usually “obs_next” after “done” flag is considered to be invalid, and its q/advantage value can provide meaningless (even misleading) information, and should be set to 0 by hand. But if “done” flag is generated because timelimit of game length (info[“TimeLimit.truncated”] is set to True in gym’s settings), “obs_next” will instead be valid. Value mask is typically used for assisting in calculating the correct q/advantage value.
- Parameters
buffer (ReplayBuffer) – the corresponding replay buffer.
indices (numpy.ndarray) – indices of replay buffer whose “obs_next” will be judged.
- Returns
A bool type numpy.ndarray in the same shape with indices. “True” means “obs_next” of that buffer[indices] is valid.
- static compute_episodic_return(batch: Batch, buffer: ReplayBuffer, indices: ndarray, v_s_: Optional[Union[ndarray, Tensor]] = None, v_s: Optional[Union[ndarray, Tensor]] = None, gamma: float = 0.99, gae_lambda: float = 0.95) Tuple[ndarray, ndarray] [source]¶
Compute returns over given batch.
Use Implementation of Generalized Advantage Estimator (arXiv:1506.02438) to calculate q/advantage value of given batch.
- Parameters
batch (Batch) – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().
indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].
v_s (np.ndarray) – the value function of all next states \(V(s')\).
gamma (float) – the discount factor, should be in [0, 1]. Default to 0.99.
gae_lambda (float) – the parameter for Generalized Advantage Estimation, should be in [0, 1]. Default to 0.95.
- Returns
two numpy arrays (returns, advantage) with each shape (bsz, ).
- static compute_nstep_return(batch: Batch, buffer: ReplayBuffer, indice: ndarray, target_q_fn: Callable[[ReplayBuffer, ndarray], Tensor], gamma: float = 0.99, n_step: int = 1, rew_norm: bool = False) Batch [source]¶
Compute n-step return for Q-learning targets.
\[G_t = \sum_{i = t}^{t + n - 1} \gamma^{i - t}(1 - d_i)r_i + \gamma^n (1 - d_{t + n}) Q_{\mathrm{target}}(s_{t + n})\]where \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\), \(d_t\) is the done flag of step \(t\).
- Parameters
batch (Batch) – a data batch, which is equal to buffer[indice].
buffer (ReplayBuffer) – the data buffer.
target_q_fn (function) – a function which compute target Q value of “obs_next” given data buffer and wanted indices.
gamma (float) – the discount factor, should be in [0, 1]. Default to 0.99.
n_step (int) – the number of estimation step, should be an int greater than 0. Default to 1.
rew_norm (bool) – normalize the reward to Normal(0, 1), Default to False.
- Returns
a Batch. The result will be stored in batch.returns as a torch.Tensor with the same shape as target_q_fn’s return tensor.
- training: bool¶
- class tianshou.policy.RandomPolicy(observation_space: Optional[Space] = None, action_space: Optional[Space] = None, action_scaling: bool = False, action_bound_method: str = '', lr_scheduler: Optional[Union[LambdaLR, MultipleLRSchedulers]] = None)[source]¶
Bases:
BasePolicy
A random agent used in multi-agent learning.
It randomly chooses an action from the legal action.
- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, **kwargs: Any) Batch [source]¶
Compute the random action over the given batch data.
The input should contain a mask in batch.obs, with “True” to be available and “False” to be unavailable. For example,
batch.obs.mask == np.array([[False, True, False]])
means with batch size 1, action “1” is available but action “0” and “2” are unavailable.- Returns
A
Batch
with “act” key, containing the random action.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Since a random agent learns nothing, it returns an empty dict.
- training: bool¶
Model-free¶
DQN Family¶
- class tianshou.policy.DQNPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, is_double: bool = True, clip_loss_grad: bool = False, **kwargs: Any)[source]¶
Bases:
BasePolicy
Implementation of Deep Q Network. arXiv:1312.5602.
Implementation of Double Q-Learning. arXiv:1509.06461.
Implementation of Dueling DQN. arXiv:1511.06581 (the dueling DQN is implemented in the network side, not here).
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
is_double (bool) – use double dqn. Default to True.
clip_loss_grad (bool) – clip the gradient of the loss in accordance with nature14236; this amounts to using the Huber loss instead of the MSE loss. Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- train(mode: bool = True) DQNPolicy [source]¶
Set the module in training mode, except for the target network.
- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Compute the n-step return for Q-learning targets.
More details can be found at
compute_nstep_return()
.
- compute_q_value(logits: Tensor, mask: Optional[ndarray]) Tensor [source]¶
Compute the q value based on the network’s raw output and action mask.
- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, model: str = 'model', input: str = 'obs', **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- exploration_noise(act: Union[ndarray, Batch], batch: Batch) Union[ndarray, Batch] [source]¶
Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
- training: bool¶
- class tianshou.policy.BranchingDQNPolicy(model: BranchingNet, optim: Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, is_double: bool = True, **kwargs: Any)[source]¶
Bases:
DQNPolicy
Implementation of the Branching dual Q network arXiv:1711.08946.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
is_double (bool) – use double network. Default to True.
See also
Please refer to
BasePolicy
for more detailed explanation.- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Compute the 1-step return for BDQ targets.
- forward(batch: Batch, state: Optional[Union[Dict, Batch, ndarray]] = None, model: str = 'model', input: str = 'obs', **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- exploration_noise(act: Union[ndarray, Batch], batch: Batch) Union[ndarray, Batch] [source]¶
Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
- training: bool¶
- class tianshou.policy.C51Policy(model: Module, optim: Optimizer, discount_factor: float = 0.99, num_atoms: int = 51, v_min: float = -10.0, v_max: float = 10.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶
Bases:
DQNPolicy
Implementation of Categorical Deep Q-Network. arXiv:1707.06887.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_atoms (int) – the number of atoms in the support set of the value distribution. Default to 51.
v_min (float) – the value of the smallest atom in the support set. Default to -10.0.
v_max (float) – the value of the largest atom in the support set. Default to 10.0.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
DQNPolicy
for more detailed explanation.- compute_q_value(logits: Tensor, mask: Optional[ndarray]) Tensor [source]¶
Compute the q value based on the network’s raw output and action mask.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.RainbowPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, num_atoms: int = 51, v_min: float = -10.0, v_max: float = 10.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶
Bases:
C51Policy
Implementation of Rainbow DQN. arXiv:1710.02298.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_atoms (int) – the number of atoms in the support set of the value distribution. Default to 51.
v_min (float) – the value of the smallest atom in the support set. Default to -10.0.
v_max (float) – the value of the largest atom in the support set. Default to 10.0.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
C51Policy
for more detailed explanation.- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.QRDQNPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶
Bases:
DQNPolicy
Implementation of Quantile Regression Deep Q-Network. arXiv:1710.10044.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value. Default to 200.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
DQNPolicy
for more detailed explanation.- compute_q_value(logits: Tensor, mask: Optional[ndarray]) Tensor [source]¶
Compute the q value based on the network’s raw output and action mask.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.IQNPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, sample_size: int = 32, online_sample_size: int = 8, target_sample_size: int = 8, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶
Bases:
QRDQNPolicy
Implementation of Implicit Quantile Network. arXiv:1806.06923.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
sample_size (int) – the number of samples for policy evaluation. Default to 32.
online_sample_size (int) – the number of samples for online model in training. Default to 8.
target_sample_size (int) – the number of samples for target model in training. Default to 8.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
QRDQNPolicy
for more detailed explanation.- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, model: str = 'model', input: str = 'obs', **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.FQFPolicy(model: FullQuantileFunction, optim: Optimizer, fraction_model: FractionProposalNetwork, fraction_optim: Optimizer, discount_factor: float = 0.99, num_fractions: int = 32, ent_coef: float = 0.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶
Bases:
QRDQNPolicy
Implementation of Fully-parameterized Quantile Function. arXiv:1911.02140.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
fraction_model (FractionProposalNetwork) – a FractionProposalNetwork for proposing fractions/quantiles given state.
fraction_optim (torch.optim.Optimizer) – a torch.optim for optimizing the fraction model above.
discount_factor (float) – in [0, 1].
num_fractions (int) – the number of fractions to use. Default to 32.
ent_coef (float) – the coefficient for entropy loss. Default to 0.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
QRDQNPolicy
for more detailed explanation.- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, model: str = 'model', input: str = 'obs', fractions: Optional[Batch] = None, **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
On-policy¶
- class tianshou.policy.PGPolicy(model: Module, optim: Optimizer, dist_fn: Type[Distribution], discount_factor: float = 0.99, reward_normalization: bool = False, action_scaling: bool = True, action_bound_method: str = 'clip', deterministic_eval: bool = False, **kwargs: Any)[source]¶
Bases:
BasePolicy
Implementation of REINFORCE algorithm.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).
- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
- Returns
A
Batch
which has 4 keys:act
the action.logits
the network’s raw output.dist
the action distribution.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, batch_size: int, repeat: int, **kwargs: Any) Dict[str, List[float]] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.NPGPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Type[Distribution], advantage_normalization: bool = True, optim_critic_iters: int = 5, actor_step_size: float = 0.5, **kwargs: Any)[source]¶
Bases:
A2CPolicy
Implementation of Natural Policy Gradient.
https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.
optim_critic_iters (int) – Number of times to optimize critic network per update. Default to 5.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).
- learn(batch: Batch, batch_size: int, repeat: int, **kwargs: Any) Dict[str, List[float]] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.A2CPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Type[Distribution], vf_coef: float = 0.5, ent_coef: float = 0.01, max_grad_norm: Optional[float] = None, gae_lambda: float = 0.95, max_batchsize: int = 256, **kwargs: Any)[source]¶
Bases:
PGPolicy
Implementation of Synchronous Advantage Actor-Critic. arXiv:1602.01783.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
vf_coef (float) – weight for value loss. Default to 0.5.
ent_coef (float) – weight for entropy loss. Default to 0.01.
max_grad_norm (float) – clipping gradients in back propagation. Default to None.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).
- learn(batch: Batch, batch_size: int, repeat: int, **kwargs: Any) Dict[str, List[float]] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.TRPOPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Type[Distribution], max_kl: float = 0.01, backtrack_coeff: float = 0.8, max_backtracks: int = 10, **kwargs: Any)[source]¶
Bases:
NPGPolicy
Implementation of Trust Region Policy Optimization. arXiv:1502.05477.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.
optim_critic_iters (int) – Number of times to optimize critic network per update. Default to 5.
max_kl (int) – max kl-divergence used to constrain each actor network update. Default to 0.01.
backtrack_coeff (float) – Coefficient to be multiplied by step size when constraints are not met. Default to 0.8.
max_backtracks (int) – Max number of backtracking times in linesearch. Default to 10.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
- learn(batch: Batch, batch_size: int, repeat: int, **kwargs: Any) Dict[str, List[float]] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.PPOPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Type[Distribution], eps_clip: float = 0.2, dual_clip: Optional[float] = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, **kwargs: Any)[source]¶
Bases:
A2CPolicy
Implementation of Proximal Policy Optimization. arXiv:1707.06347.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
eps_clip (float) – \(\epsilon\) in \(L_{CLIP}\) in the original paper. Default to 0.2.
dual_clip (float) – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound. Default to 5.0 (set None if you do not want to use it).
value_clip (bool) – a parameter mentioned in arXiv:1811.02553v3 Sec. 4.1. Default to True.
advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.
recompute_advantage (bool) – whether to recompute advantage every update repeat according to https://arxiv.org/pdf/2006.05990.pdf Sec. 3.5. Default to False.
vf_coef (float) – weight for value loss. Default to 0.5.
ent_coef (float) – weight for entropy loss. Default to 0.01.
max_grad_norm (float) – clipping gradients in back propagation. Default to None.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1, also normalize the advantage to Normal(0, 1). Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).
- learn(batch: Batch, batch_size: int, repeat: int, **kwargs: Any) Dict[str, List[float]] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
Off-policy¶
- class tianshou.policy.DDPGPolicy(actor: ~typing.Optional[~torch.nn.modules.module.Module], actor_optim: ~typing.Optional[~torch.optim.optimizer.Optimizer], critic: ~typing.Optional[~torch.nn.modules.module.Module], critic_optim: ~typing.Optional[~torch.optim.optimizer.Optimizer], tau: float = 0.005, gamma: float = 0.99, exploration_noise: ~typing.Optional[~tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, reward_normalization: bool = False, estimation_step: int = 1, action_scaling: bool = True, action_bound_method: str = 'clip', **kwargs: ~typing.Any)[source]¶
Bases:
BasePolicy
Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic (torch.nn.Module) – the critic network. (s, a -> Q(s, a))
critic_optim (torch.optim.Optimizer) – the optimizer for critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
exploration_noise (BaseNoise) – the exploration noise, add to the action. Default to
GaussianNoise(sigma=0.1)
.reward_normalization (bool) – normalize the reward to Normal(0, 1), Default to False.
estimation_step (int) – the number of steps to look ahead. Default to 1.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- train(mode: bool = True) DDPGPolicy [source]¶
Set the module in training mode, except for the target network.
- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Pre-process the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.
- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, model: str = 'actor', input: str = 'obs', **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
- Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- exploration_noise(act: Union[ndarray, Batch], batch: Batch) Union[ndarray, Batch] [source]¶
Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
- training: bool¶
- class tianshou.policy.TD3Policy(actor: ~torch.nn.modules.module.Module, actor_optim: ~torch.optim.optimizer.Optimizer, critic1: ~torch.nn.modules.module.Module, critic1_optim: ~torch.optim.optimizer.Optimizer, critic2: ~torch.nn.modules.module.Module, critic2_optim: ~torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, exploration_noise: ~typing.Optional[~tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: ~typing.Any)[source]¶
Bases:
DDPGPolicy
Implementation of TD3, arXiv:1802.09477.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
exploration_noise (float) – the exploration noise, add to the action. Default to
GaussianNoise(sigma=0.1)
policy_noise (float) – the noise used in updating policy network. Default to 0.2.
update_actor_freq (int) – the update frequency of actor network. Default to 2.
noise_clip (float) – the clipping range used in updating policy network. Default to 0.5.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- train(mode: bool = True) TD3Policy [source]¶
Set the module in training mode, except for the target network.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- actor: torch.nn.Module¶
- actor_optim: torch.optim.Optimizer¶
- critic: torch.nn.Module¶
- critic_optim: torch.optim.Optimizer¶
- class tianshou.policy.SACPolicy(actor: Module, actor_optim: Optimizer, critic1: Module, critic1_optim: Optimizer, critic2: Module, critic2_optim: Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, Tensor, Optimizer]] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, exploration_noise: Optional[BaseNoise] = None, deterministic_eval: bool = True, **kwargs: Any)[source]¶
Bases:
DDPGPolicy
Implementation of Soft Actor-Critic. arXiv:1812.05905.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
alpha ((float, torch.Tensor, torch.optim.Optimizer) or float) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
exploration_noise (BaseNoise) – add a noise to action for exploration. Default to None. This is useful when solving hard-exploration problem.
deterministic_eval (bool) – whether to use deterministic action (mean of Gaussian policy) instead of stochastic action sampled by the policy. Default to True.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- actor: torch.nn.Module¶
- actor_optim: torch.optim.Optimizer¶
- train(mode: bool = True) SACPolicy [source]¶
Set the module in training mode, except for the target network.
- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, input: str = 'obs', **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
- Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- training: bool¶
- critic: torch.nn.Module¶
- critic_optim: torch.optim.Optimizer¶
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- class tianshou.policy.REDQPolicy(actor: Module, actor_optim: Optimizer, critics: Module, critics_optim: Optimizer, ensemble_size: int = 10, subset_size: int = 2, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, Tensor, Optimizer]] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, actor_delay: int = 20, exploration_noise: Optional[BaseNoise] = None, deterministic_eval: bool = True, target_mode: str = 'min', **kwargs: Any)[source]¶
Bases:
DDPGPolicy
Implementation of REDQ. arXiv:2101.05982.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critics (torch.nn.Module) – critic ensemble networks.
critics_optim (torch.optim.Optimizer) – the optimizer for the critic networks.
ensemble_size (int) – Number of sub-networks in the critic ensemble. Default to 10.
subset_size (int) – Number of networks in the subset. Default to 2.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
alpha ((float, torch.Tensor, torch.optim.Optimizer) or float) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
actor_delay (int) – Number of critic updates before an actor update. Default to 20.
exploration_noise (BaseNoise) – add a noise to action for exploration. Default to None. This is useful when solving hard-exploration problem.
deterministic_eval (bool) – whether to use deterministic action (mean of Gaussian policy) instead of stochastic action sampled by the policy. Default to True.
target_mode (str) – methods to integrate critic values in the subset, currently support minimum and average. Default to min.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
See also
Please refer to
BasePolicy
for more detailed explanation.- actor: torch.nn.Module¶
- actor_optim: torch.optim.Optimizer¶
- train(mode: bool = True) REDQPolicy [source]¶
Set the module in training mode, except for the target network.
- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, input: str = 'obs', **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
- Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- training: bool¶
- critic: torch.nn.Module¶
- critic_optim: torch.optim.Optimizer¶
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- class tianshou.policy.DiscreteSACPolicy(actor: Module, actor_optim: Optimizer, critic1: Module, critic1_optim: Optimizer, critic2: Module, critic2_optim: Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, Tensor, Optimizer]] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: Any)[source]¶
Bases:
SACPolicy
Implementation of SAC for Discrete Action Settings. arXiv:1910.07207.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s -> Q(s))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s -> Q(s))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
alpha ((float, torch.Tensor, torch.optim.Optimizer) or float) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, the alpha is automatically tuned.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, input: str = 'obs', **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
- Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- exploration_noise(act: Union[ndarray, Batch], batch: Batch) Union[ndarray, Batch] [source]¶
Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
- training: bool¶
- actor: torch.nn.Module¶
- actor_optim: torch.optim.Optimizer¶
- critic: torch.nn.Module¶
- critic_optim: torch.optim.Optimizer¶
Imitation¶
- class tianshou.policy.ImitationPolicy(model: Module, optim: Optimizer, **kwargs: Any)[source]¶
Bases:
BasePolicy
Implementation of vanilla imitation learning.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> a)optim (torch.optim.Optimizer) – for optimizing the model.
action_space (gym.Space) – env’s action space.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
- Returns
A
Batch
which MUST have the following keys:act
an numpy.ndarray or a torch.Tensor, the action over given batch data.state
a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy,None
as default.
Other keys are user-defined. It depends on the algorithm. For example,
# some code return Batch(logits=..., act=..., state=None, dist=...)
The keyword
policy
is reserved and the corresponding data will be stored into the replay buffer. For instance,# some code return Batch(..., policy=Batch(log_prob=dist.log_prob(act))) # and in the sampled data batch, you can directly use # batch.policy.log_prob to get your data.
Note
In continuous action space, you should do another step “map_action” to get the real action:
act = policy(batch).act # doesn't map to the target action range act = policy.map_action(act, batch)
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.BCQPolicy(actor: Module, actor_optim: Optimizer, critic1: Module, critic1_optim: Optimizer, critic2: Module, critic2_optim: Optimizer, vae: VAE, vae_optim: Optimizer, device: Union[str, device] = 'cpu', gamma: float = 0.99, tau: float = 0.005, lmbda: float = 0.75, forward_sampled_times: int = 100, num_sampled_action: int = 10, **kwargs: Any)[source]¶
Bases:
BasePolicy
Implementation of BCQ algorithm. arXiv:1812.02900.
- Parameters
actor (Perturbation) – the actor perturbation. (s, a -> perturbed a)
actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
vae (VAE) – the VAE network, generating actions similar to those in batch. (s, a -> generated a)
vae_optim (torch.optim.Optimizer) – the optimizer for the VAE network.
device (Union[str, torch.device]) – which device to create this model on. Default to “cpu”.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
tau (float) – param for soft update of the target network. Default to 0.005.
lmbda (float) – param for Clipped Double Q-learning. Default to 0.75.
forward_sampled_times (int) – the number of sampled actions in forward function. The policy samples many actions and takes the action with the max value. Default to 100.
num_sampled_action (int) – the number of sampled actions in calculating target Q. The algorithm samples several actions using VAE, and perturbs each action to get the target Q. Default to 10.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- train(mode: bool = True) BCQPolicy [source]¶
Set the module in training mode, except for the target network.
- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.CQLPolicy(actor: ActorProb, actor_optim: Optimizer, critic1: Module, critic1_optim: Optimizer, critic2: Module, critic2_optim: Optimizer, cql_alpha_lr: float = 0.0001, cql_weight: float = 1.0, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, Tensor, Optimizer]] = 0.2, temperature: float = 1.0, with_lagrange: bool = True, lagrange_threshold: float = 10.0, min_action: float = -1.0, max_action: float = 1.0, num_repeat_actions: int = 10, alpha_min: float = 0.0, alpha_max: float = 1000000.0, clip_grad: float = 1.0, device: Union[str, device] = 'cpu', **kwargs: Any)[source]¶
Bases:
SACPolicy
Implementation of CQL algorithm. arXiv:2006.04779.
- Parameters
actor (ActorProb) – the actor network following the rules in
BasePolicy
. (s -> a)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
cql_alpha_lr (float) – the learning rate of cql_log_alpha. Default to 1e-4.
cql_weight (float) – the value of alpha. Default to 1.0.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
alpha ((float, torch.Tensor, torch.optim.Optimizer) or float) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.
temperature (float) – the value of temperature. Default to 1.0.
with_lagrange (bool) – whether to use Lagrange. Default to True.
lagrange_threshold (float) – the value of tau in CQL(Lagrange). Default to 10.0.
min_action (float) – The minimum value of each dimension of action. Default to -1.0.
max_action (float) – The maximum value of each dimension of action. Default to 1.0.
num_repeat_actions (int) – The number of times the action is repeated when calculating log-sum-exp. Default to 10.
alpha_min (float) – lower bound for clipping cql_alpha. Default to 0.0.
alpha_max (float) – upper bound for clipping cql_alpha. Default to 1e6.
clip_grad (float) – clip_grad for updating critic network. Default to 1.0.
device (Union[str, torch.device]) – which device to create this model on. Default to “cpu”.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- train(mode: bool = True) CQLPolicy [source]¶
Set the module in training mode, except for the target network.
- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Pre-process the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- actor: torch.nn.Module¶
- actor_optim: torch.optim.Optimizer¶
- critic: torch.nn.Module¶
- critic_optim: torch.optim.Optimizer¶
- class tianshou.policy.TD3BCPolicy(actor: ~torch.nn.modules.module.Module, actor_optim: ~torch.optim.optimizer.Optimizer, critic1: ~torch.nn.modules.module.Module, critic1_optim: ~torch.optim.optimizer.Optimizer, critic2: ~torch.nn.modules.module.Module, critic2_optim: ~torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, exploration_noise: ~typing.Optional[~tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, alpha: float = 2.5, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: ~typing.Any)[source]¶
Bases:
TD3Policy
Implementation of TD3+BC. arXiv:2106.06860.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
exploration_noise (float) – the exploration noise, add to the action. Default to
GaussianNoise(sigma=0.1)
policy_noise (float) – the noise used in updating policy network. Default to 0.2.
update_actor_freq (int) – the update frequency of actor network. Default to 2.
noise_clip (float) – the clipping range used in updating policy network. Default to 0.5.
alpha (float) – the value of alpha, which controls the weight for TD3 learning relative to behavior cloning.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- actor: torch.nn.Module¶
- actor_optim: torch.optim.Optimizer¶
- critic: torch.nn.Module¶
- critic_optim: torch.optim.Optimizer¶
- class tianshou.policy.DiscreteBCQPolicy(model: Module, imitator: Module, optim: Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 8000, eval_eps: float = 0.001, unlikely_action_threshold: float = 0.3, imitation_logits_penalty: float = 0.01, reward_normalization: bool = False, **kwargs: Any)[source]¶
Bases:
DQNPolicy
Implementation of discrete BCQ algorithm. arXiv:1910.01708.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> q_value)imitator (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> imitation_logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency.
eval_eps (float) – the epsilon-greedy noise added in evaluation.
unlikely_action_threshold (float) – the threshold (tau) for unlikely actions, as shown in Equ. (17) in the paper. Default to 0.3.
imitation_logits_penalty (float) – regularization weight for imitation logits. Default to 1e-2.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- train(mode: bool = True) DiscreteBCQPolicy [source]¶
Set the module in training mode, except for the target network.
- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, input: str = 'obs', **kwargs: Any) Batch [source]¶
Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
- Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.DiscreteCQLPolicy(model: Module, optim: Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, min_q_weight: float = 10.0, **kwargs: Any)[source]¶
Bases:
QRDQNPolicy
Implementation of discrete Conservative Q-Learning algorithm. arXiv:2006.04779.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value. Default to 200.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
min_q_weight (float) – the weight for the cql loss.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
QRDQNPolicy
for more detailed explanation.- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.DiscreteCRRPolicy(actor: Module, critic: Module, optim: Optimizer, discount_factor: float = 0.99, policy_improvement_mode: str = 'exp', ratio_upper_bound: float = 20.0, beta: float = 1.0, min_q_weight: float = 10.0, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶
Bases:
PGPolicy
Implementation of discrete Critic Regularized Regression. arXiv:2006.15134.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the action-value critic (i.e., Q function) network. (s -> Q(s, *))
optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1]. Default to 0.99.
policy_improvement_mode (str) – type of the weight function f. Possible values: “binary”/”exp”/”all”. Default to “exp”.
ratio_upper_bound (float) – when policy_improvement_mode is “exp”, the value of the exp function is upper-bounded by this parameter. Default to 20.
beta (float) – when policy_improvement_mode is “exp”, this is the denominator of the exp function. Default to 1.
min_q_weight (float) – weight for CQL loss/regularizer. Default to 10.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
PGPolicy
for more detailed explanation.- training: bool¶
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- class tianshou.policy.GAILPolicy(actor: Module, critic: Module, optim: Optimizer, dist_fn: Type[Distribution], expert_buffer: ReplayBuffer, disc_net: Module, disc_optim: Optimizer, disc_update_num: int = 4, eps_clip: float = 0.2, dual_clip: Optional[float] = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, **kwargs: Any)[source]¶
Bases:
PPOPolicy
Implementation of Generative Adversarial Imitation Learning. arXiv:1606.03476.
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
expert_buffer (ReplayBuffer) – the replay buffer contains expert experience.
disc_net (torch.nn.Module) – the discriminator network with input dim equals state dim plus action dim and output dim equals 1.
disc_optim (torch.optim.Optimizer) – the optimizer for the discriminator network.
disc_update_num (int) – the number of discriminator grad steps per model grad step. Default to 4.
discount_factor (float) – in [0, 1]. Default to 0.99.
eps_clip (float) – \(\epsilon\) in \(L_{CLIP}\) in the original paper. Default to 0.2.
dual_clip (float) – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound. Default to 5.0 (set None if you do not want to use it).
value_clip (bool) – a parameter mentioned in arXiv:1811.02553 Sec. 4.1. Default to True.
advantage_normalization (bool) – whether to do per mini-batch advantage normalization. Default to True.
recompute_advantage (bool) – whether to recompute advantage every update repeat according to https://arxiv.org/pdf/2006.05990.pdf Sec. 3.5. Default to False.
vf_coef (float) – weight for value loss. Default to 0.5.
ent_coef (float) – weight for entropy loss. Default to 0.01.
max_grad_norm (float) – clipping gradients in back propagation. Default to None.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1, also normalize the advantage to Normal(0, 1). Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
PPOPolicy
for more detailed explanation.- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Pre-process the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.
- learn(batch: Batch, batch_size: int, repeat: int, **kwargs: Any) Dict[str, List[float]] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
Model-based¶
- class tianshou.policy.PSRLPolicy(trans_count_prior: ndarray, rew_mean_prior: ndarray, rew_std_prior: ndarray, discount_factor: float = 0.99, epsilon: float = 0.01, add_done_loop: bool = False, **kwargs: Any)[source]¶
Bases:
BasePolicy
Implementation of Posterior Sampling Reinforcement Learning.
Reference: Strens M. A Bayesian framework for reinforcement learning [C] //ICML. 2000, 2000: 943-950.
- Parameters
trans_count_prior (np.ndarray) – dirichlet prior (alphas), with shape (n_state, n_action, n_state).
rew_mean_prior (np.ndarray) – means of the normal priors of rewards, with shape (n_state, n_action).
rew_std_prior (np.ndarray) – standard deviations of the normal priors of rewards, with shape (n_state, n_action).
discount_factor (float) – in [0, 1].
epsilon (float) – for precision control in value iteration.
add_done_loop (bool) – whether to add an extra self-loop for the terminal state in MDP. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, **kwargs: Any) Batch [source]¶
Compute action over the given batch data with PSRL model.
- Returns
A
Batch
with “act” key containing the action.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: Batch, *args: Any, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
- class tianshou.policy.ICMPolicy(policy: BasePolicy, model: IntrinsicCuriosityModule, optim: Optimizer, lr_scale: float, reward_scale: float, forward_loss_weight: float, **kwargs: Any)[source]¶
Bases:
BasePolicy
Implementation of Intrinsic Curiosity Module. arXiv:1705.05363.
- Parameters
policy (BasePolicy) – a base policy to add ICM to.
model (IntrinsicCuriosityModule) – the ICM model.
optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
lr_scale (float) – the scaling factor for ICM learning.
forward_loss_weight (float) – the weight for forward model loss.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
See also
Please refer to
BasePolicy
for more detailed explanation.- forward(batch: Batch, state: Optional[Union[dict, Batch, ndarray]] = None, **kwargs: Any) Batch [source]¶
Compute action over the given batch data by inner policy.
See also
Please refer to
forward()
for more detailed explanation.
- exploration_noise(act: Union[ndarray, Batch], batch: Batch) Union[ndarray, Batch] [source]¶
Modify the action from policy.forward with exploration noise.
- Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns
action in the same form of input “act” but with added exploration noise.
- process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) Batch [source]¶
Pre-process the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.
- post_process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) None [source]¶
Post-process the data from the provided replay buffer.
Typical usage is to update the sampling weight in prioritized experience replay. Used in
update()
.
- learn(batch: Batch, **kwargs: Any) Dict[str, float] [source]¶
Update policy with a given batch of data.
- Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.
- training: bool¶
Multi-agent¶
- class tianshou.policy.MultiAgentPolicyManager(policies: List[BasePolicy], env: PettingZooEnv, **kwargs: Any)[source]¶
Bases:
BasePolicy
Multi-agent policy manager for MARL.
This multi-agent policy manager accepts a list of
BasePolicy
. It dispatches the batch data to each of these policies when the “forward” is called. The same as “process_fn” and “learn”: it splits the data and feeds them to each policy. A figure in Multi-Agent Reinforcement Learning can help you better understand this procedure.- replace_policy(policy: BasePolicy, agent_id: int) None [source]¶
Replace the “agent_id”th policy in this manager.
- process_fn(batch: Batch, buffer: ReplayBuffer, indice: ndarray) Batch [source]¶
Dispatch batch data from obs.agent_id to every policy’s process_fn.
Save original multi-dimensional rew in “save_rew”, set rew to the reward of each agent during their “process_fn”, and restore the original reward afterwards.
- exploration_noise(act: Union[ndarray, Batch], batch: Batch) Union[ndarray, Batch] [source]¶
Add exploration noise from sub-policy onto act.
- forward(batch: Batch, state: Optional[Union[dict, Batch]] = None, **kwargs: Any) Batch [source]¶
Dispatch batch data from obs.agent_id to every policy’s forward.
- Parameters
state – if None, it means all agents have no state. If not None, it should contain keys of “agent_1”, “agent_2”, …
- Returns
a Batch with the following contents:
{ "act": actions corresponding to the input "state": { "agent_1": output state of agent_1's policy for the state "agent_2": xxx ... "agent_n": xxx} "out": { "agent_1": output of agent_1's policy for the input "agent_2": xxx ... "agent_n": xxx} }
- learn(batch: Batch, **kwargs: Any) Dict[str, Union[float, List[float]]] [source]¶
Dispatch the data to all policies for learning.
- Returns
a dict with the following contents:
{ "agent_1/item1": item 1 of agent_1's policy.learn output "agent_1/item2": item 2 of agent_1's policy.learn output "agent_2/xxx": xxx ... "agent_n/xxx": xxx }
- training: bool¶