Benchmark

Mujoco Benchmark

Tianshou’s Mujoco benchmark contains state-of-the-art results.

Every experiment is conducted under 10 random seeds for 1-10M steps. Please refer to https://github.com/thu-ml/tianshou/tree/master/examples/mujoco for source code and detailed results.



The table below compares the performance of Tianshou against published results on OpenAI Gym MuJoCo benchmarks. We use max average return in 1M timesteps as the reward metric. ~ means the result is approximated from the plots because quantitative results are not provided. - means results are not provided. The best-performing baseline on each task is highlighted in boldface. Referenced baselines include TD3 paper, SAC paper, PPO paper, ACKTR paper, OpenAI Baselines and Spinning Up.

Task

Ant

HalfCheetah

Hopper

Walker2d

Swimmer

Humanoid

Reacher

IPendulum

IDPendulum

DDPG

Tianshou

990.4

11718.7

2197.0

1400.6

144.1

177.3

-3.3

1000.0

8364.3

TD3 Paper

1005.3

3305.6

2020.5

1843.6

/

/

-6.5

1000.0

9355.5

TD3 Paper (Our)

888.8

8577.3

1860.0

3098.1

/

/

-4.0

1000.0

8370.0

Spinning Up

~840

~11000

~1800

~1950

~137

/

/

/

/

TD3

Tianshou

5116.4

10201.2

3472.2

3982.4

104.2

5189.5

-2.7

1000.0

9349.2

TD3 Paper

4372.4

9637.0

3564.1

4682.8

/

/

-3.6

1000.0

9337.5

Spinning Up

~3800

~9750

~2860

~4000

~78

/

/

/

/

SAC

Tianshou

5850.2

12138.8

3542.2

5007.0

44.4

5488.5

-2.6

1000.0

9359.5

SAC Paper

~3720

~10400

~3370

~3740

/

~5200

/

/

/

TD3 Paper

655.4

2347.2

2996.7

1283.7

/

/

-4.4

1000.0

8487.2

Spinning Up

~3980

~11520

~3150

~4250

~41.7

/

/

/

/

A2C

Tianshou

3485.4

1829.9

1253.2

1091.6

36.6

1726.0

-6.7

1000.0

9257.7

PPO Paper

/

~1000

~900

~850

~31

/

~-24

~1000

~7100

PPO Paper (TR)

/

~930

~1220

~700

~36

/

~-27

~1000

~8100

PPO

Tianshou

3258.4

5783.9

2609.3

3588.5

66.7

787.1

-4.1

1000.0

9231.3

PPO Paper

/

~1800

~2330

~3460

~108

/

~-7

~1000

~8000

TD3 Paper

1083.2

1795.4

2164.7

3317.7

/

/

-6.2

1000.0

8977.9

OpenAI Baselines

/

~1700

~2400

~3510

~111

/

~-6

~940

~7350

Spinning Up

~650

~1670

~1850

~1230

~120

/

/

/

/

TRPO

Tianshou

2866.7

4471.2

2046.0

3826.7

40.9

810.1

-5.1

1000.0

8435.2

ACKTR paper

~0

~400

~1400

~550

~40

/

-8

~1000

~800

PPO Paper

/

~0

~2100

~1100

~121

/

~-115

~1000

~200

TD3 paper

-75.9

-15.6

2471.3

2321.5

/

/

-111.4

985.4

205.9

OpenAI Baselines

/

~1350

~2200

~2350

~95

/

~-5

~910

~7000

Spinning Up (TF)

~150

~850

~1200

~600

~85

/

/

/

/

Runtime averaged on 8 MuJoCo benchmark tasks is listed below. All results are obtained using a single Nvidia TITAN X GPU and up to 48 CPU cores (at most one CPU core for each thread).

Algorithm

# of Envs

1M timesteps

Collecting (%)

Updating (%)

Evaluating (%)

Others (%)

DDPG

1

2.9h

12.0

80.2

2.4

5.4

TD3

1

3.3h

11.4

81.7

1.7

5.2

SAC

1

5.2h

10.9

83.8

1.8

3.5

REINFORCE

64

4min

84.9

1.8

12.5

0.8

A2C

16

7min

62.5

28.0

6.6

2.9

PPO

64

24min

11.4

85.3

3.2

0.2

NPG

16

7min

65.1

24.9

9.5

0.6

TRPO

16

7min

62.9

26.5

10.1

0.6

Atari Benchmark

Tianshou also provides reliable and reproducible Atari 10M benchmark.

Every experiment is conducted under 10 random seeds for 10M steps. Please refer to https://github.com/thu-ml/tianshou/tree/master/examples/atari for source code and refer to https://wandb.ai/tianshou/atari.benchmark/reports/Atari-Benchmark–VmlldzoxOTA1NzA5 for detailed results hosted on wandb.



The table below compares the performance of Tianshou against published results on Atari games. We use max average return in 10M timesteps as the reward metric (to be consistent with Mujoco). / means results are not provided. The best-performing baseline on each task is highlighted in boldface. Referenced baselines include Google Dopamine and OpenAI Baselines.

Task

Pong

Breakout

Enduro

Qbert

MsPacman

Seaquest

SpaceInvaders

DQN

Tianshou

20.2 ± 2.3

133.5 ± 44.6

997.9 ± 180.6

11620.2 ± 786.1

2324.8 ± 359.8

3213.9 ± 381.6

947.9 ± 155.3

Dopamine

9.8

92.2

2126.9

6836.7

2451.3

1406.6

1559.1

OpenAI Baselines

16.5

131.5

479.8

3254.8

/

1164.1

1129.5 ± 145.3

C51

Tianshou

20.6 ± 2.4

412.9 ± 35.8

940.8 ± 133.9

12513.2 ± 1274.6

2254.9 ± 201.2

3305.4 ± 1524.3

557.3

Dopamine

17.4

222.4

665.3

9924.5

2860.4

1706.6

604.6 ± 157.5

Rainbow

Tianshou

20.2 ± 3.0

440.4 ± 50.1

1496.1 ± 112.3

14224.8 ± 1230.1

2524.2 ± 338.8

1934.6 ± 376.4

1178.4

Dopamine

19.1

47.9

2185.1

15682.2

3161.7

3328.9

459.9

IQN

Tianshou

20.7 ± 2.9

355.9 ± 22.7

1252.7 ± 118.1

14409.2 ± 808.6

2228.6 ± 253.1

5341.2 ± 670.2

667.8 ± 81.5

Dopamine

19.6

96.3

1227.6

12496.7

4422.7

16418

1358.2 ± 267.6

PPO

Tianshou

20.3 ± 1.2

283.0 ± 74.3

1098.9 ± 110.5

12341.8 ± 1760.7

1699.4 ± 248.0

1035.2 ± 353.6

1641.3

OpenAI Baselines

13.7

114.3

350.2

7012.1

/

1218.9

1787.5 ± 340.8

QR-DQN

Tianshou

20.7 ± 2.0

228.3 ± 27.3

951.7 ± 333.5

14761.5 ± 862.9

2259.3 ± 269.2

4187.6 ± 725.7

1114.7 ± 116.9

FQF

Tianshou

20.4 ± 2.5

382.6 ± 29.5

1816.8 ± 314.3

15301.2 ± 684.1

2506.6 ± 402.5

8051.5 ± 3155.6

2558.3

Please note that the comparison table for both two benchmarks could NOT be used to prove which implementation is “better”. The hyperparameters of the algorithms vary across different implementations. Also, the reward metric is not strictly the same (e.g. Tianshou uses max average return in 10M steps but OpenAI Baselines only report average return at 10M steps, which is unfair). Lastly, Tianshou always uses 10 random seeds while others might use fewer. The comparison is here only to show Tianshou’s reliability.