discrete_sac

discrete_sac#

Source code: tianshou/policy/modelfree/discrete_sac.py

class DiscreteSACPolicy(*, actor: Module, actor_optim: Optimizer, critic: Module, critic_optim: Optimizer, action_space: Discrete, critic2: Module | None = None, critic2_optim: Optimizer | None = None, tau: float = 0.005, gamma: float = 0.99, alpha: float | tuple[float, Tensor, Optimizer] = 0.2, estimation_step: int = 1, observation_space: Space | None = None, lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of SAC for Discrete Action Settings. arXiv:1910.07207.

Parameters:

actor – the actor network following the rules in BasePolicy. (s -> logits)
actor_optim – the optimizer for actor network.
critic – the first critic network. (s, a -> Q(s, a))
critic_optim – the optimizer for the first critic network.
action_space – Env’s action space. Should be gym.spaces.Box.
critic2 – the second critic network. (s, a -> Q(s, a)). If None, use the same network as critic (via deepcopy).
critic2_optim – the optimizer for the second critic network. If None, clone critic_optim to use for critic2.parameters().
tau – param for soft update of the target network.
gamma – discount factor, in [0, 1].
alpha – entropy regularization coefficient. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.
estimation_step – the number of steps to look ahead for calculating
observation_space – Env’s observation space.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update()

discrete_sac

Contents

discrete_sac#