discrete_sac#
Source code: tianshou/policy/modelfree/discrete_sac.py
- class DiscreteSACPolicy(*, actor: Module, actor_optim: Optimizer, critic: Module, critic_optim: Optimizer, action_space: Discrete, critic2: Module | None = None, critic2_optim: Optimizer | None = None, tau: float = 0.005, gamma: float = 0.99, alpha: float | tuple[float, Tensor, Optimizer] = 0.2, estimation_step: int = 1, observation_space: Space | None = None, lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#
Implementation of SAC for Discrete Action Settings. arXiv:1910.07207.
- Parameters:
actor – the actor network following the rules in
BasePolicy
. (s -> logits)actor_optim – the optimizer for actor network.
critic – the first critic network. (s, a -> Q(s, a))
critic_optim – the optimizer for the first critic network.
action_space – Env’s action space. Should be gym.spaces.Box.
critic2 – the second critic network. (s, a -> Q(s, a)). If None, use the same network as critic (via deepcopy).
critic2_optim – the optimizer for the second critic network. If None, clone critic_optim to use for critic2.parameters().
tau – param for soft update of the target network.
gamma – discount factor, in [0, 1].
alpha – entropy regularization coefficient. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.
estimation_step – the number of steps to look ahead for calculating
observation_space – Env’s observation space.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update()
See also
Please refer to
BasePolicy
for more detailed explanation.- exploration_noise(act: ndarray | BatchProtocol, batch: RolloutBatchProtocol) ndarray | BatchProtocol [source]#
Modify the action from policy.forward with exploration noise.
NOTE: currently does not add any noise! Needs to be overridden by subclasses to actually do something.
- Parameters:
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
- Returns:
action in the same form of input “act” but with added exploration noise.
- forward(batch: ObsBatchProtocol, state: dict | Batch | ndarray | None = None, **kwargs: Any) Batch [source]#
Compute action over the given batch data.
- Returns:
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.
- learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) TDiscreteSACTrainingStats [source]#
Update policy with a given batch of data.
- Returns:
A dataclass object, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.