a2c

a2c#

Source code: tianshou/policy/modelfree/a2c.py

class A2CPolicy(*, actor: Module | ActorProb | Actor, critic: Module | Critic | Critic, optim: Optimizer, dist_fn: Callable[[tuple[Tensor, Tensor]], Distribution] | Callable[[Tensor], Categorical], action_space: Space, vf_coef: float = 0.5, ent_coef: float = 0.01, max_grad_norm: float | None = None, gae_lambda: float = 0.95, max_batchsize: int = 256, discount_factor: float = 0.99, reward_normalization: bool = False, deterministic_eval: bool = False, observation_space: Space | None = None, action_scaling: bool = True, action_bound_method: Literal['clip', 'tanh'] | None = 'clip', lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of Synchronous Advantage Actor-Critic. arXiv:1602.01783.

Parameters:

actor – the actor network following the rules: If self.action_type == “discrete”: (s_B ->`action_values_BA`). If self.action_type == “continuous”: (s_B -> dist_input_BD).
critic – the critic network. (s -> V(s))
optim – the optimizer for actor and critic network.
dist_fn – distribution class for computing the action.
action_space – env’s action space
vf_coef – weight for value loss.
ent_coef – weight for entropy loss.
max_grad_norm – clipping gradients in back propagation.
gae_lambda – in [0, 1], param for Generalized Advantage Estimation.
max_batchsize – the maximum size of the batch when computing GAE.
discount_factor – in [0, 1].
reward_normalization – normalize estimated values to have std close to 1.
deterministic_eval – if True, use deterministic evaluation.
observation_space – the space of the observation.
action_scaling – if True, scale the action from [-1, 1] to the range of action_space. Only used if the action_space is continuous.
action_bound_method – method to bound action to range [-1, 1]. Only used if the action_space is continuous.
lr_scheduler – if not None, will be called in policy.update().

a2c

Contents

a2c#