fqf

fqf#

Source code: tianshou/policy/modelfree/fqf.py

class FQFPolicy(*, model: FullQuantileFunction, optim: Optimizer, fraction_model: FractionProposalNetwork, fraction_optim: Optimizer, action_space: Discrete, discount_factor: float = 0.99, num_fractions: int = 32, ent_coef: float = 0.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, is_double: bool = True, clip_loss_grad: bool = False, observation_space: Space | None = None, lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of Fully-parameterized Quantile Function. arXiv:1911.02140.

Parameters:

model – a model following the rules in BasePolicy. (s -> logits)
optim – a torch.optim for optimizing the model.
fraction_model – a FractionProposalNetwork for proposing fractions/quantiles given state.
fraction_optim – a torch.optim for optimizing the fraction model above.
action_space – Env’s action space.
discount_factor – in [0, 1].
num_fractions – the number of fractions to use.
ent_coef – the coefficient for entropy loss.
estimation_step – the number of steps to look ahead.
target_update_freq – the target network update frequency (0 if you do not use the target network).
reward_normalization – normalize the returns to Normal(0, 1). TODO: rename to return_normalization?
is_double – use double dqn.
clip_loss_grad – clip the gradient of the loss in accordance with nature14236; this amounts to using the Huber loss instead of the MSE loss.
observation_space – Env’s observation space.
lr_scheduler – if not None, will be called in policy.update().

See also

Please refer to QRDQNPolicy for more detailed explanation.

forward(batch: ObsBatchProtocol, state: dict | Batch | ndarray | None = None, model: Literal['model', 'model_old'] = 'model', fractions: Batch | None = None, **kwargs: Any) → FQFBatchProtocol[source]#

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)

Returns:

A Batch which has 3 keys:

act the action.
logits the network’s raw output.
state the hidden state.

fqf

Contents

fqf#