td3_bc#


class TD3BCPolicy(*, actor: ~torch.nn.modules.module.Module, actor_optim: ~torch.optim.optimizer.Optimizer, critic: ~torch.nn.modules.module.Module, critic_optim: ~torch.optim.optimizer.Optimizer, action_space: ~gymnasium.spaces.space.Space, critic2: ~torch.nn.modules.module.Module | None = None, critic2_optim: ~torch.optim.optimizer.Optimizer | None = None, tau: float = 0.005, gamma: float = 0.99, exploration_noise: ~tianshou.exploration.random.BaseNoise | None = <tianshou.exploration.random.GaussianNoise object>, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, alpha: float = 2.5, estimation_step: int = 1, observation_space: ~gymnasium.spaces.space.Space | None = None, action_scaling: bool = True, action_bound_method: ~typing.Literal['clip'] | None = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LRScheduler | ~tianshou.utils.lr_scheduler.MultipleLRSchedulers | None = None)[source]#

Implementation of TD3+BC. arXiv:2106.06860.

Parameters:
  • actor – the actor network following the rules in BasePolicy. (s -> logits)

  • actor_optim – the optimizer for actor network.

  • critic – the first critic network. (s, a -> Q(s, a))

  • critic_optim – the optimizer for the first critic network.

  • action_space – Env’s action space. Should be gym.spaces.Box.

  • critic2 – the second critic network. (s, a -> Q(s, a)). If None, use the same network as critic (via deepcopy).

  • critic2_optim – the optimizer for the second critic network. If None, clone critic_optim to use for critic2.parameters().

  • tau – param for soft update of the target network.

  • gamma – discount factor, in [0, 1].

  • exploration_noise – add noise to action for exploration. This is useful when solving “hard exploration” problems. “default” is equivalent to GaussianNoise(sigma=0.1).

  • policy_noise – the noise used in updating policy network.

  • update_actor_freq – the update frequency of actor network.

  • noise_clip – the clipping range used in updating policy network.

  • alpha – the value of alpha, which controls the weight for TD3 learning relative to behavior cloning.

  • observation_space – Env’s observation space.

  • action_scaling – if True, scale the action from [-1, 1] to the range of action_space. Only used if the action_space is continuous.

  • action_bound_method – method to bound action to range [-1, 1]. Only used if the action_space is continuous.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update()

See also

Please refer to BasePolicy for more detailed explanation.

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) TTD3BCTrainingStats[source]#

Update policy with a given batch of data.

Returns:

A dataclass object, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

class TD3BCTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, actor_loss: float, critic1_loss: float, critic2_loss: float)[source]#