psrl

psrl#

Source code: tianshou/policy/modelbased/psrl.py

class PSRLModel(trans_count_prior: ndarray, rew_mean_prior: ndarray, rew_std_prior: ndarray, discount_factor: float, epsilon: float)[source]#

Implementation of Posterior Sampling Reinforcement Learning Model.

Parameters:

trans_count_prior – dirichlet prior (alphas), with shape (n_state, n_action, n_state).
rew_mean_prior – means of the normal priors of rewards, with shape (n_state, n_action).
rew_std_prior – standard deviations of the normal priors of rewards, with shape (n_state, n_action).
discount_factor – in [0, 1].
epsilon – for precision control in value iteration.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).

observe(trans_count: ndarray, rew_sum: ndarray, rew_square_sum: ndarray, rew_count: ndarray) → None[source]#

Add data into memory pool.

For rewards, we have a normal prior at first. After we observed a reward for a given state-action pair, we use the mean value of our observations instead of the prior mean as the posterior mean. The standard deviations are in inverse proportion to the number of the corresponding observations.

Parameters:

trans_count – the number of observations, with shape (n_state, n_action, n_state).
rew_sum – total rewards, with shape (n_state, n_action).
rew_square_sum – total rewards’ squares, with shape (n_state, n_action).
rew_count – the number of rewards, with shape (n_state, n_action).

sample_reward() → ndarray[source]#

sample_trans_prob() → ndarray[source]#

solve_policy() → None[source]#

static value_iteration(trans_prob: ndarray, rew: ndarray, discount_factor: float, eps: float, value: ndarray) → tuple[ndarray, ndarray][source]#

Value iteration solver for MDPs.

Parameters:

trans_prob – transition probabilities, with shape (n_state, n_action, n_state).
rew – rewards, with shape (n_state, n_action).
eps – for precision control.
discount_factor – in [0, 1].
value – the initialize value of value array, with shape (n_state, ).

Returns:

the optimal policy with shape (n_state, ).

class PSRLPolicy(*, trans_count_prior: ndarray, rew_mean_prior: ndarray, rew_std_prior: ndarray, action_space: Discrete, discount_factor: float = 0.99, epsilon: float = 0.01, add_done_loop: bool = False, observation_space: Space | None = None, lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of Posterior Sampling Reinforcement Learning.

Reference: Strens M. A Bayesian framework for reinforcement learning [C] //ICML. 2000, 2000: 943-950.

Parameters:

trans_count_prior – dirichlet prior (alphas), with shape (n_state, n_action, n_state).
rew_mean_prior – means of the normal priors of rewards, with shape (n_state, n_action).
rew_std_prior – standard deviations of the normal priors of rewards, with shape (n_state, n_action).
action_space – Env’s action_space.
discount_factor – in [0, 1].
epsilon – for precision control in value iteration.
add_done_loop – whether to add an extra self-loop for the terminal state in MDP. Default to False.
observation_space – Env’s observation space.
lr_scheduler – if not None, will be called in policy.update().

psrl

Contents

psrl#