sac

sac#

Source code: tianshou/policy/modelfree/sac.py

class SACPolicy(*, actor: Module, actor_optim: Optimizer, critic: Module, critic_optim: Optimizer, action_space: Space, critic2: Module | None = None, critic2_optim: Optimizer | None = None, tau: float = 0.005, gamma: float = 0.99, alpha: float | tuple[float, Tensor, Optimizer] = 0.2, estimation_step: int = 1, exploration_noise: BaseNoise | Literal['default'] | None = None, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: Literal['clip'] | None = 'clip', observation_space: Space | None = None, lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of Soft Actor-Critic. arXiv:1812.05905.

Parameters:

actor – the actor network following the rules in BasePolicy. (s -> logits)
actor_optim – the optimizer for actor network.
critic – the first critic network. (s, a -> Q(s, a))
critic_optim – the optimizer for the first critic network.
action_space – Env’s action space. Should be gym.spaces.Box.
critic2 – the second critic network. (s, a -> Q(s, a)). If None, use the same network as critic (via deepcopy).
critic2_optim – the optimizer for the second critic network. If None, clone critic_optim to use for critic2.parameters().
tau – param for soft update of the target network.
gamma – discount factor, in [0, 1].
alpha – entropy regularization coefficient. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.
estimation_step – The number of steps to look ahead.
exploration_noise – add noise to action for exploration. This is useful when solving “hard exploration” problems. “default” is equivalent to GaussianNoise(sigma=0.1).
deterministic_eval – whether to use deterministic action (mode of Gaussian policy) in evaluation mode instead of stochastic action sampled by the policy. Does not affect training.
action_scaling – whether to map actions from range [-1, 1] to range[action_spaces.low, action_spaces.high].
action_bound_method – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Only used if the action_space is continuous.
observation_space – Env’s observation space.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update()

sac

Contents

sac#