discrete_bcq

discrete_bcq#

Source code: tianshou/policy/imitation/discrete_bcq.py

class DiscreteBCQPolicy(*, model: Module, imitator: Module, optim: Optimizer, action_space: Discrete, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 8000, eval_eps: float = 0.001, unlikely_action_threshold: float = 0.3, imitation_logits_penalty: float = 0.01, reward_normalization: bool = False, is_double: bool = True, clip_loss_grad: bool = False, observation_space: Space | None = None, lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of discrete BCQ algorithm. arXiv:1910.01708.

Parameters:

model – a model following the rules (s_B -> action_values_BA)
imitator – a model following the rules in BasePolicy. (s -> imitation_logits)
optim – a torch.optim for optimizing the model.
discount_factor – in [0, 1].
estimation_step – the number of steps to look ahead
target_update_freq – the target network update frequency.
eval_eps – the epsilon-greedy noise added in evaluation.
unlikely_action_threshold – the threshold (tau) for unlikely actions, as shown in Equ. (17) in the paper.
imitation_logits_penalty – regularization weight for imitation logits.
estimation_step – the number of steps to look ahead.
target_update_freq – the target network update frequency (0 if you do not use the target network).
reward_normalization – normalize the returns to Normal(0, 1). TODO: rename to return_normalization?
is_double – use double dqn.
clip_loss_grad – clip the gradient of the loss in accordance with nature14236; this amounts to using the Huber loss instead of the MSE loss.
observation_space – Env’s observation space.
lr_scheduler – if not None, will be called in policy.update().

discrete_bcq

Contents

discrete_bcq#