Interact and run this jupyter notebook online:
Also find this lecture rendered as HTML slides on github $\nearrow$ along with the source repository $\nearrow$.
Imports and modules:
from config import (np, plt, print_qtable, Maze, QLearner, plot_q_table,
plot_greedy_policy, tf, e_trajectory,
ClassicalDDPG, trainer, plot_training_log, run_correction)
%matplotlib inline
If the progress bar by tqdm
(trange
) in this document does not work, run this:
!jupyter nbextension enable --py widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
- Validating: OK
image by GeekStyle
DeepMind, 2015 & 2017: AlphaGo & AlphaZero
OpenAI, 2019: hide-and-seek
DeepMind & EPFL, 2022: tokamak control
DeepMind, 2022: AlphaTensor
UZH & Intel Labs, 2023: Drone racing
There are various RL algorithms suitable for different types of tasks
Often the choice of algorithm depends on whether we deal with discrete or continuous state-action spaces
We will go through:
env = Maze(height=3, width=5)
env.plot(title='Initial state');
# Take some actions
env.plot(title='Initial state')
env.step(action='up')
env.plot()
env.step(action='right')
env.plot();
Exercise 1
Using the reward definitions from the previous slide, try to calculate the cumulative rewards for the trajectories shown below. Can you tell which of the paths are equally good / bad?image by D. Silver - Lecture on RL
$S = \{\text{Class 1, Class 2, Class 3, Facebook, Pub, Pass, Sleep}\}$
Note that "Sleep" is also called a terminal state, because once in it we will never leave it.
image by D. Silver - Lecture on RL
$\Rightarrow$ Return: $G_0 = (-2) + 0.5 * (-2) + 0.5^2 (-2) + (+10) * 0.5^3 = -2.25$ (with $\gamma = 0.5$)
image by D. Silver - Lecture on RL
N.B.: stochastic state transitions are still allowed (if we decide to go to the Pub, anything can happen).
Today we will work with fully deterministic MDPs only.
Episodic MDP
Continuous MDP
Exercise 2
Let's get back to the maze! For now we do not care about optimal decisions. Instead, try to implement a random policy, i.e. every action $a \in \{\text{'up', 'down', 'left', 'right'}\}$ is picked with equal probability no matter what state the agent is in.
a) Initialize a Maze with height=3, width=2
and complete the all_actions
list.
b) Look at every step of the output: is the movement of the agent ( x ) and the rewards obtained consistent with your expectations?
c) Change the random number seed, rerun and observe.
np.random.seed(123457)
env = Maze(height=3, width=2) # FILL HERE)
env.plot(title='Initial state')
all_actions = ['up', 'down', 'left', 'right'] # ... FILL HERE]
done = False
while not done:
action = np.random.choice(all_actions)
state, action, reward, new_state, done = env.step(action)
env.plot();
There are many different algorithms for finding the optimal policy $\pi^*$
They all have their pros and cons
Today: we are going to look at Q-learning. It is one of the core ideas of many RL algorithms, such as
image by Open AI - Spinning Up
image adapted from S. Levine, "Deep Reinforcement Learning" (lecture)
to solve the RL problem
image by AssemblyAI
image by Berkeley AI course
np.random.seed(0)
# Initialize small maze environment
env = Maze(width=2, height=2, fire_positions=[[1, 0]])
_ = env.plot(add_player_position=False)
# Initialize Q-learner with Q-table
qtable_learner = QLearner(env, q_function='table')
print('Initial Q-table')
q_table = qtable_learner.q_func.get_q_table()
print_qtable(q_table)
Initial Q-table +--------+-----+------+------+-------+ | s \ a | up | down | left | right | +--------+-----+------+------+-------+ | (0, 0) | 0.0 | 0.0 | 0.0 | 0.0 | | (0, 1) | 0.0 | 0.0 | 0.0 | 0.0 | | (1, 0) | 0.0 | 0.0 | 0.0 | 0.0 | | (1, 1) | 0.0 | 0.0 | 0.0 | 0.0 | +--------+-----+------+------+-------+
qtable_learner.train(200)
print('Q-table after 200 episodes')
q_table = qtable_learner.q_func.get_q_table()
print_qtable(q_table)
0%| | 0/200 [00:00<?, ?it/s]
Q-table after 200 episodes +--------+------+------+------+-------+ | s \ a | up | down | left | right | +--------+------+------+------+-------+ | (0, 0) | 19.9 | 7.8 | 7.9 | 7.1 | | (0, 1) | 14.8 | 11.0 | 14.1 | 26.6 | | (1, 0) | 22.0 | 5.0 | 8.7 | 4.4 | | (1, 1) | 0.0 | 0.0 | 0.0 | 0.0 | +--------+------+------+------+-------+
qtable_learner.train(300)
print('Q-table after 500 episodes')
q_table = qtable_learner.q_func.get_q_table()
print_qtable(q_table)
0%| | 0/300 [00:00<?, ?it/s]
Q-table after 500 episodes +--------+------+------+------+-------+ | s \ a | up | down | left | right | +--------+------+------+------+-------+ | (0, 0) | 28.3 | 22.0 | 22.0 | 17.4 | | (0, 1) | 23.8 | 25.8 | 24.1 | 29.9 | | (1, 0) | 28.8 | 15.9 | 23.8 | 15.7 | | (1, 1) | 0.0 | 0.0 | 0.0 | 0.0 | +--------+------+------+------+-------+
qtable_learner.plot_training_evolution()
Exercise 3
a) Based on the evolution of the Q-values on the previous slide - would you consider the training to be complete after 500 episodes?
b) Play with the number of episodes in the cell below until you find convergence.
c) Observe that some of the Q-table values converge earlier than others during training. Why could that be?
np.random.seed(0)
env = Maze(width=2, height=2, fire_positions=[[1, 0]])
qtable_learner = QLearner(env, q_function='table')
qtable_learner.train(2000) # FILL HERE)
qtable_learner.plot_training_evolution()
0%| | 0/2000 [00:00<?, ?it/s]
Exercise 4
a) Initialize a bigger maze width=4
, height=3
, with fire_positions=[[2, 1], [2, 2]]
and use q_function='table'
in the QLearner
class. Then train it for 5000
episodes.
b) Once the training is finished, plot the Q-values (you can just execute the cell, it is already complete).
Next to each little arrow there is a number that denotes the Q-value of the corresponding action on that field. The red arrow indicates the action with the highest Q-value.
c) Finally, also plot the (greedy) policy by executing the third cell. Compare it to the Q-value plot to verify that we indeed always pick the action with the highest Q-value. Are there fields where two actions would be equally good (which ones)? Can you confirm that by looking at the Q-values?
# Exercise 4 a)
np.random.seed(123456)
env = Maze(width=4, # FILL HERE,
height=3, # FILL HERE,
fire_positions=[[2, 1], [2, 2]]) # FILL HERE)
qtable_learner = QLearner(env, q_function='table') # FILL HERE)
qtable_learner.train(5000) # FILL HERE)
0%| | 0/5000 [00:00<?, ?it/s]
# In case the training does not work for you for some reason
# you can reload the qtable from a trained agent from file.
# Don't forget to initialize the env as suggested in the
# exercise ...
# Note that the q evolution history of training is not saved
# and will hence not be displayed when reloading from file.
# qtable_learner.q_func.load_q_table('saved_agents/qtable_ex4.json')
# Exercise 4 b)
q_table = qtable_learner.q_func.get_q_table()
ax = env.plot(add_player_position=False, title=False)
plot_q_table(q_table, env.target_position, env.fire_positions, ax=ax)
# Exercise 4 c)
policy = qtable_learner.q_func.get_greedy_policy()
ax = env.plot(add_player_position=False, title=False)
plot_greedy_policy(policy, env.target_position, env.fire_positions, ax=ax)
Exercise 5 (optional)
a) Using the same maze as above, reduce the punishment of going through fire by setting a fire_reward=-2
(instead of -10) in the environment definition.
b) Retrain the agent. How does the policy change compared to Ex. 4? Can you explain why?
np.random.seed(123456)
# Env definition
env = Maze(width=4, height=3, fire_positions=[[2, 1], [2, 2]], fire_reward=-2) # FILL HERE)
qtable_learner = QLearner(env, q_function='table')
qtable_learner.train(500)
# If you have issues with the training, please comment out the
# line qtable_learner.train(5000) above and reload instead the
# qtable by uncommenting the following line.
# qtable_learner.q_func.load_q_table('saved_agents/qtable_ex5.json')
0%| | 0/500 [00:00<?, ?it/s]
# Show Q-values
q_table = qtable_learner.q_func.get_q_table()
ax = env.plot(add_player_position=False, title=False)
plot_q_table(q_table, env.target_position, env.fire_positions, ax=ax)
# Show policy
policy = qtable_learner.q_func.get_greedy_policy()
ax = env.plot(add_player_position=False, title=False)
plot_greedy_policy(policy, env.target_position, env.fire_positions, ax=ax)
Exercise 6
a) Repeat the same steps as in Ex. 4 for the Q-table learner, but this time using q_function='net'
as an argument in the QLearner
class. Train it for 1500
episodes. This will take a couple of minutes.
b) Compare the Q-values and policy to the one obtained with Q-table learning. Do you see differences? Why could that be?
tf.keras.utils.set_random_seed(0)
env = Maze(width=4, height=3, fire_positions=[[2, 1], [2, 2]])
qnet_learner = QLearner(env, q_function='net') # FILL HERE)
qnet_learner.train(1500) # FILL HERE)
0%| | 0/1500 [00:00<?, ?it/s]
# Again, if you face any issues with model training, please use
# the saved q-net weights of a trained agent by uncommenting
# the following. You can comment the line for training in the
# previous cell.
# qnet_learner.q_func.load_model('saved_agents/qnet_ex6')
q_table = qnet_learner.q_func.get_q_table()
ax = env.plot(add_player_position=False, title=False)
plot_q_table(q_table, env.target_position, env.fire_positions, ax=ax)
policy = qnet_learner.q_func.get_greedy_policy()
ax = env.plot(add_player_position=False, title=False)
plot_greedy_policy(policy, env.target_position, env.fire_positions, ax=ax)
Easy to understand and validate
Discrete $S$, $A$ spaces only
Relatively small $S$, $A$ spaces only
Big and continuous $S$ possible
No need to visit all states during training, because NNs are great interpolators
Discrete and relatively small $A$
Training may be unstable and harder to verify if we have reached convergence
Two NNs
Actor
Critic
N.B.: networks are trained simultaneously
image by AWAKE Collaboration
Exercise 7
Let's try to train an actor-critic agent on the AWAKE environment! We are using the DDPG (Deep Deterministic Policy Gradient) algorithm. It is one of the most basic actor-critic algorithms and hence also not the most stable one. Some improvements have been implemented in TD3.
7 a) Run the following cell to initialize the AWAKE simulation environment env
and a DDPG instance agent
. Then reset the environment to misteer the beam, and plot the trajectory. The plot shows the beam position at the 10 BPMs installed along the electron beam line.
# Exercise 7 a)
tf.keras.utils.set_random_seed(12345)
env = e_trajectory()
agent = ClassicalDDPG(state_space=env.observation_space, action_space=env.action_space)
env.reset(init_outside_threshold=True)
env.plot_trajectory()
7 b) Run the next cell to make a correction to the beam position. Run the cell multiple times and check how the trajectories before and after correction compare. Do you think the RL agent is doing a good job? Why or why not?
# Exercise 7 b)
run_correction(env, agent)
# Exercise 7 b)
run_correction(env, agent)
7 c) Run the next cell to train the RL agent. Can you interpret the output plots showing evolution of agent training? Is the length of training appropriate or should we train with fewer / more steps?
Hints: the output figure shows two axes. The top graph displays the length of each episode over the entire training period (an episode is terminated either when the objective is reached or whenever the agent cannot solve the task after 30 steps). The bottom plot shows the rewards (negative trajectory rms) at the beginning and at the end of an episode. A high negative reward means that the trajectory is badly steered. A reward close to zero on the other hand corresponds to a well-corrected beam trajectory.
# Exercise 7 c)
training_log = trainer(env=env, agent=agent, n_steps=500)
0%| | 0/500 [00:00<?, ?it/s]
EPISODE: 0, INITIAL REWARD: -134.26, FINAL REWARD: -73.393, #STEPS: 1. EPISODE: 50, INITIAL REWARD: -96.326, FINAL REWARD: -49.14, #STEPS: 1. EPISODE: 100, INITIAL REWARD: -116.591, FINAL REWARD: -38.265, #STEPS: 1. EPISODE: 150, INITIAL REWARD: -323.567, FINAL REWARD: -75.531, #STEPS: 1. EPISODE: 200, INITIAL REWARD: -27.811, FINAL REWARD: -60.445, #STEPS: 1. EPISODE: 250, INITIAL REWARD: -48.02, FINAL REWARD: -57.575, #STEPS: 1.
plot_training_log(env, agent, training_log)
# In case you face any issues with the training of the agent,
# avoid executing the previous cell and use instead the
# following line to load pre-trained weights.
# agent.load_actor_critic_weights('saved_agents/ddpg_ex7')
7 d) Check a few trajectories before and after correction now using the trained agent (run the cell multiple times). How does the agent perform now?
Note that we are using the DDPG algorithm here which is one of the most basic actor-critic RL algorithms. This is also the reason why the trajectory correction will not always be perfect. There are much improved versions that would train in a shorter time and with better performance (e.g. Twin Delayed DDPG aka TD3).
# Exercise 7 d)
run_correction(env, agent)