tianshou.trainer¶
-
tianshou.trainer.
offpolicy_trainer
(policy: tianshou.policy.base.BasePolicy, train_collector: tianshou.data.collector.Collector, test_collector: tianshou.data.collector.Collector, max_epoch: int, step_per_epoch: int, step_per_collect: int, episode_per_test: int, batch_size: int, update_per_step: Union[int, float] = 1, train_fn: Optional[Callable[[int, int], None]] = None, test_fn: Optional[Callable[[int, Optional[int]], None]] = None, stop_fn: Optional[Callable[[float], bool]] = None, save_fn: Optional[Callable[[tianshou.policy.base.BasePolicy], None]] = None, reward_metric: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None, logger: tianshou.utils.log_tools.BaseLogger = <tianshou.utils.log_tools.LazyLogger object>, verbose: bool = True, test_in_train: bool = True) → Dict[str, Union[float, str]][source]¶ A wrapper for off-policy trainer procedure.
The “step” in trainer means an environment step (a.k.a. transition).
- Parameters
policy – an instance of the
BasePolicy
class.train_collector (Collector) – the collector used for training.
test_collector (Collector) – the collector used for testing.
max_epoch (int) – the maximum number of epochs for training. The training process might be finished before reaching
max_epoch
ifstop_fn
is set.step_per_epoch (int) – the number of transitions collected per epoch.
step_per_collect (int) – the number of transitions the collector would collect before the network update, i.e., trainer will collect “step_per_collect” transitions and do some policy network update repeatly in each epoch.
episode_per_test – the number of episodes for one policy evaluation.
batch_size (int) – the batch size of sample data, which is going to feed in the policy network.
update_per_step (int/float) – the number of times the policy network would be updated per transition after (step_per_collect) transitions are collected, e.g., if update_per_step set to 0.3, and step_per_collect is 256, policy will be updated round(256 * 0.3 = 76.8) = 77 times after 256 transitions are collected by the collector. Default to 1.
train_fn (function) – a hook called at the beginning of training in each epoch. It can be used to perform custom additional operations, with the signature
f( num_epoch: int, step_idx: int) -> None
.test_fn (function) – a hook called at the beginning of testing in each epoch. It can be used to perform custom additional operations, with the signature
f( num_epoch: int, step_idx: int) -> None
.save_fn (function) – a hook called when the undiscounted average mean reward in evaluation phase gets better, with the signature
f(policy:BasePolicy) -> None
.stop_fn (function) – a function with signature
f(mean_rewards: float) -> bool
, receives the average undiscounted returns of the testing result, returns a boolean which indicates whether reaching the goal.reward_metric (function) – a function with signature
f(rewards: np.ndarray with shape (num_episode, agent_num)) -> np.ndarray with shape (num_episode,)
, used in multi-agent RL. We need to return a single scalar for each episode’s result to monitor training in the multi-agent RL setting. This function specifies what is the desired metric, e.g., the reward of agent 1 or the average reward over all agents.logger (BaseLogger) – A logger that logs statistics during training/testing/updating. Default to a logger that doesn’t log anything.
verbose (bool) – whether to print the information. Default to True.
test_in_train (bool) – whether to test in the training phase. Default to True.
- Returns
See
gather_info()
.
-
tianshou.trainer.
onpolicy_trainer
(policy: tianshou.policy.base.BasePolicy, train_collector: tianshou.data.collector.Collector, test_collector: tianshou.data.collector.Collector, max_epoch: int, step_per_epoch: int, repeat_per_collect: int, episode_per_test: int, batch_size: int, step_per_collect: Optional[int] = None, episode_per_collect: Optional[int] = None, train_fn: Optional[Callable[[int, int], None]] = None, test_fn: Optional[Callable[[int, Optional[int]], None]] = None, stop_fn: Optional[Callable[[float], bool]] = None, save_fn: Optional[Callable[[tianshou.policy.base.BasePolicy], None]] = None, reward_metric: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None, logger: tianshou.utils.log_tools.BaseLogger = <tianshou.utils.log_tools.LazyLogger object>, verbose: bool = True, test_in_train: bool = True) → Dict[str, Union[float, str]][source]¶ A wrapper for on-policy trainer procedure.
The “step” in trainer means an environment step (a.k.a. transition).
- Parameters
policy – an instance of the
BasePolicy
class.train_collector (Collector) – the collector used for training.
test_collector (Collector) – the collector used for testing.
max_epoch (int) – the maximum number of epochs for training. The training process might be finished before reaching
max_epoch
ifstop_fn
is set.step_per_epoch (int) – the number of transitions collected per epoch.
repeat_per_collect (int) – the number of repeat time for policy learning, for example, set it to 2 means the policy needs to learn each given batch data twice.
episode_per_test (int) – the number of episodes for one policy evaluation.
batch_size (int) – the batch size of sample data, which is going to feed in the policy network.
step_per_collect (int) – the number of transitions the collector would collect before the network update, i.e., trainer will collect “step_per_collect” transitions and do some policy network update repeatly in each epoch.
episode_per_collect (int) – the number of episodes the collector would collect before the network update, i.e., trainer will collect “episode_per_collect” episodes and do some policy network update repeatly in each epoch.
train_fn (function) – a hook called at the beginning of training in each epoch. It can be used to perform custom additional operations, with the signature
f( num_epoch: int, step_idx: int) -> None
.test_fn (function) – a hook called at the beginning of testing in each epoch. It can be used to perform custom additional operations, with the signature
f( num_epoch: int, step_idx: int) -> None
.save_fn (function) – a hook called when the undiscounted average mean reward in evaluation phase gets better, with the signature
f(policy: BasePolicy) -> None
.stop_fn (function) – a function with signature
f(mean_rewards: float) -> bool
, receives the average undiscounted returns of the testing result, returns a boolean which indicates whether reaching the goal.reward_metric (function) – a function with signature
f(rewards: np.ndarray with shape (num_episode, agent_num)) -> np.ndarray with shape (num_episode,)
, used in multi-agent RL. We need to return a single scalar for each episode’s result to monitor training in the multi-agent RL setting. This function specifies what is the desired metric, e.g., the reward of agent 1 or the average reward over all agents.logger (BaseLogger) – A logger that logs statistics during training/testing/updating. Default to a logger that doesn’t log anything.
verbose (bool) – whether to print the information. Default to True.
test_in_train (bool) – whether to test in the training phase. Default to True.
- Returns
See
gather_info()
.
Note
Only either one of step_per_collect and episode_per_collect can be specified.
-
tianshou.trainer.
offline_trainer
(policy: tianshou.policy.base.BasePolicy, buffer: tianshou.data.buffer.base.ReplayBuffer, test_collector: tianshou.data.collector.Collector, max_epoch: int, update_per_epoch: int, episode_per_test: int, batch_size: int, test_fn: Optional[Callable[[int, Optional[int]], None]] = None, stop_fn: Optional[Callable[[float], bool]] = None, save_fn: Optional[Callable[[tianshou.policy.base.BasePolicy], None]] = None, reward_metric: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None, logger: tianshou.utils.log_tools.BaseLogger = <tianshou.utils.log_tools.LazyLogger object>, verbose: bool = True) → Dict[str, Union[float, str]][source]¶ A wrapper for offline trainer procedure.
The “step” in offline trainer means a gradient step.
- Parameters
policy – an instance of the
BasePolicy
class.test_collector (Collector) – the collector used for testing.
max_epoch (int) – the maximum number of epochs for training. The training process might be finished before reaching
max_epoch
ifstop_fn
is set.update_per_epoch (int) – the number of policy network updates, so-called gradient steps, per epoch.
episode_per_test – the number of episodes for one policy evaluation.
batch_size (int) – the batch size of sample data, which is going to feed in the policy network.
test_fn (function) – a hook called at the beginning of testing in each epoch. It can be used to perform custom additional operations, with the signature
f( num_epoch: int, step_idx: int) -> None
.save_fn (function) – a hook called when the undiscounted average mean reward in evaluation phase gets better, with the signature
f(policy: BasePolicy) -> None
.stop_fn (function) – a function with signature
f(mean_rewards: float) -> bool
, receives the average undiscounted returns of the testing result, returns a boolean which indicates whether reaching the goal.reward_metric (function) – a function with signature
f(rewards: np.ndarray with shape (num_episode, agent_num)) -> np.ndarray with shape (num_episode,)
, used in multi-agent RL. We need to return a single scalar for each episode’s result to monitor training in the multi-agent RL setting. This function specifies what is the desired metric, e.g., the reward of agent 1 or the average reward over all agents.logger (BaseLogger) – A logger that logs statistics during updating/testing. Default to a logger that doesn’t log anything.
verbose (bool) – whether to print the information. Default to True.
- Returns
See
gather_info()
.
-
tianshou.trainer.
test_episode
(policy: tianshou.policy.base.BasePolicy, collector: tianshou.data.collector.Collector, test_fn: Optional[Callable[[int, Optional[int]], None]], epoch: int, n_episode: int, logger: Optional[tianshou.utils.log_tools.BaseLogger] = None, global_step: Optional[int] = None, reward_metric: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None) → Dict[str, Any][source]¶ A simple wrapper of testing policy in collector.
-
tianshou.trainer.
gather_info
(start_time: float, train_c: Optional[tianshou.data.collector.Collector], test_c: tianshou.data.collector.Collector, best_reward: float, best_reward_std: float) → Dict[str, Union[float, str]][source]¶ A simple wrapper of gathering information from collectors.
- Returns
A dictionary with the following keys:
train_step
the total collected step of training collector;train_episode
the total collected episode of training collector;train_time/collector
the time for collecting transitions in the training collector;train_time/model
the time for training models;train_speed
the speed of training (env_step per second);test_step
the total collected step of test collector;test_episode
the total collected episode of test collector;test_time
the time for testing;test_speed
the speed of testing (env_step per second);best_reward
the best reward over the test results;duration
the total elapsed time.