Interact and run this jupyter notebook online:
$\implies$ make sure you installed all the required python packages (see the README)!
Finally, also find this lecture rendered as HTML slides on github $\nearrow$ along with the source repository $\nearrow$.
Imports and modules:
from config import (np, plt)
from scipy.constants import m_p, e, c
%matplotlib inline
image by GeekStyle
DeepMind, 2015 & 2017: AlphaGo & AlphaZero
OpenAI, 2019: hide-and-seek
OpenAI, 2022: ChatGPT
DeepMind & EPFL, 2022: tokamak control
DeepMind, 2022: AlphaTensor
There are various RL algorithms suitable for different types of tasks
Often the choice of algorithm depends on whether we deal with discrete or continuous state-action spaces
We will go through:
from qlearning.core import Maze
env = Maze(height=3, width=5)
env.plot(title='Initial state');
# Take some actions
env.plot(title='Initial state')
env.step(action='up')
env.plot()
env.step(action='right')
env.plot();
Exercise 1
Using the reward definitions from the previous slide, try to calculate the cumulative rewards for the trajectories shown below. Can you tell which of the paths are equally good / bad?
image by D. Silver - Lecture on RL
$S = \{\text{Class 1, Class 2, Class 3, Facebook, Pub, Pass, Sleep}\}$
Note that "Sleep" is also called a terminal state, because once in it we will never leave it.
image by D. Silver - Lecture on RL
$\Rightarrow$ Return: $G_0 = (-2) + 0.5 * (-2) + 0.5^2 (-2) + (+10) * 0.5^3 = -2.25$ (with $\gamma = 0.5$)
image by D. Silver - Lecture on RL
N.B.: stochastic state transitions are still allowed (if we decide to go to the Pub, anything can happen).
Today we will work with fully deterministic MDPs only.
Episodic MDP
Continuous MDP
Exercise 2
Let's get back to the maze! For now we do not care about optimal decisions. Instead, try to implement a random policy, i.e. every action $a \in \{\text{'up', 'down', 'left', 'right'}\}$ is picked with equal probability no matter what state the agent is in.
a) Initialize a Maze with height=3, width=2 and complete the all_actions list.
b) Look at every step of the output: is the movement of the agent ( x ) and the rewards obtained consistent with your expectations?
c) Change the random number seed, rerun and observe.
from qlearning.core import Maze
np.random.seed(123457)
env = Maze(height=3, width=2) # FILL HERE)
env.plot(title='Initial state')
all_actions = ['up', 'left', 'down', 'right'] # ... FILL HERE]
done = False
while not done:
action = np.random.choice(all_actions)
state, action, reward, new_state, done = env.step(action)
env.plot();
There are many different algorithms for finding the optimal policy $\pi^*$
They all have their pros and cons
Today: we are going to look at Q-learning. It is one of the core ideas of many RL algorithms, such as
image by Open AI - Spinning Up
image adapted from S. Levine, "Deep Reinforcement Learning" (lecture)
to solve the RL problem
where $\alpha$ is the learning rate
image by AssemblyAI
image by Berkeley AI course
from qlearning.plot_utils import print_qtable
from qlearning.core import Maze, QLearner
np.random.seed(0)
# Initialize small maze environment
env = Maze(width=2, height=2, fire_positions=[[1, 0]])
_ = env.plot(add_player_position=False)
# Initialize Q-learner with Q-table
qtable_learner = QLearner(env, q_function='table')
print('Initial Q-table')
q_table = qtable_learner.q_func.get_q_table()
print_qtable(q_table)
Initial Q-table +--------+-----+------+------+-------+ | s \ a | up | down | left | right | +--------+-----+------+------+-------+ | (0, 0) | 0.0 | 0.0 | 0.0 | 0.0 | | (0, 1) | 0.0 | 0.0 | 0.0 | 0.0 | | (1, 0) | 0.0 | 0.0 | 0.0 | 0.0 | | (1, 1) | 0.0 | 0.0 | 0.0 | 0.0 | +--------+-----+------+------+-------+
qtable_learner.train(200)
print('Q-table after 200 episodes')
q_table = qtable_learner.q_func.get_q_table()
print_qtable(q_table)
0%| | 0/200 [00:00<?, ?it/s]
Q-table after 200 episodes +--------+------+------+------+-------+ | s \ a | up | down | left | right | +--------+------+------+------+-------+ | (0, 0) | 19.9 | 7.8 | 7.9 | 7.1 | | (0, 1) | 14.8 | 11.0 | 14.1 | 26.6 | | (1, 0) | 22.0 | 5.0 | 8.7 | 4.4 | | (1, 1) | 0.0 | 0.0 | 0.0 | 0.0 | +--------+------+------+------+-------+
qtable_learner.train(300)
print('Q-table after 500 episodes')
q_table = qtable_learner.q_func.get_q_table()
print_qtable(q_table)
0%| | 0/300 [00:00<?, ?it/s]
Q-table after 500 episodes +--------+------+------+------+-------+ | s \ a | up | down | left | right | +--------+------+------+------+-------+ | (0, 0) | 28.3 | 22.0 | 22.0 | 17.4 | | (0, 1) | 23.8 | 25.8 | 24.1 | 29.9 | | (1, 0) | 28.8 | 15.9 | 23.8 | 15.7 | | (1, 1) | 0.0 | 0.0 | 0.0 | 0.0 | +--------+------+------+------+-------+
qtable_learner.plot_training_evolution()
Exercise 3
a) Based on the evolution of the Q-values on the previous slide - would you consider the training to be complete after 500 episodes?
b) Play with the number of episodes in the cell below until you find convergence.
c) Observe that some of the Q-table values converge earlier than others during training. Why could that be?
from qlearning.plot_utils import print_qtable
from qlearning.core import Maze, QLearner
np.random.seed(0)
env = Maze(width=2, height=2, fire_positions=[[1, 0]])
qtable_learner = QLearner(env, q_function='table')
qtable_learner.train(1500) # FILL HERE)
qtable_learner.plot_training_evolution()
0%| | 0/1500 [00:00<?, ?it/s]
Exercise 4
a) Initialize a bigger maze width=4, height=3, with fire_positions=[[2, 1], [2, 2]] and use q_function='table' in the QLearner class. Then train it for 5000 episodes.
b) Once the training is finished, plot the Q-values (you can just execute the cell, it is already complete).
Next to each little arrow there is a number that denotes the Q-value of the corresponding action on that field. The red arrow indicates the action with the highest Q-value.
c) Finally, also plot the (greedy) policy by executing the third cell. Compare it to the Q-value plot to verify that we indeed always pick the action with the highest Q-value. Are there fields where two actions would be equally good (which ones)? Can you confirm that by looking at the Q-values?
# Exercise 4 a)
from qlearning.core import Maze, QLearner
np.random.seed(123456)
env = Maze(width=4, # FILL HERE,
height=3, # FILL HERE,
fire_positions=[[2, 1], [2, 2]]) # FILL HERE)
qtable_learner = QLearner(env, q_function='table') # FILL HERE)
qtable_learner.train(5000) # FILL HERE)
0%| | 0/5000 [00:00<?, ?it/s]
# In case the training does not work for you for some reason
# you can reload the qtable from a trained agent from file.
# Don't forget to initialize the env as suggested in the
# exercise ...
# Note that the q evolution history of training is not saved
# and will hence not be displayed when reloading from file.
# qtable_learner.q_func.load_q_table('saved_agents/qtable_ex4.json')
# Exercise 4 b)
from qlearning.plot_utils import plot_q_table
q_table = qtable_learner.q_func.get_q_table()
ax = env.plot(add_player_position=False, title=False)
plot_q_table(q_table, env.target_position, env.fire_positions, ax=ax)
# Exercise 4 c)
from qlearning.plot_utils import plot_greedy_policy
policy = qtable_learner.q_func.get_greedy_policy()
ax = env.plot(add_player_position=False, title=False)
plot_greedy_policy(policy, env.target_position, env.fire_positions, ax=ax)
Exercise 5 (optional)
a) Using the same maze as above, reduce the punishment of going through fire by setting a fire_reward=-2 (instead of -10) in the environment definition.
b) Retrain the agent. How does the policy change compared to Ex. 4? Can you explain why?
from qlearning.core import Maze, QLearner
from qlearning.plot_utils import plot_q_table, plot_greedy_policy
np.random.seed(123456)
# Env definition
env = Maze(width=4, height=3, fire_positions=[[2, 1], [2, 2]], fire_reward=-2) # FILL HERE)
qtable_learner = QLearner(env, q_function='table')
qtable_learner.train(500)
# If you have issues with the training, please comment out the
# line qtable_learner.train(5000) above and reload instead the
# qtable by uncommenting the following line.
# qtable_learner.q_func.load_q_table('saved_agents/qtable_ex5.json')
0%| | 0/500 [00:00<?, ?it/s]
# Show Q-values
q_table = qtable_learner.q_func.get_q_table()
ax = env.plot(add_player_position=False, title=False)
plot_q_table(q_table, env.target_position, env.fire_positions, ax=ax)
# Show policy
policy = qtable_learner.q_func.get_greedy_policy()
ax = env.plot(add_player_position=False, title=False)
plot_greedy_policy(policy, env.target_position, env.fire_positions, ax=ax)
Main idea: replace the Q-table by a simple, feed-forward neural network (Q-net)
Developed by DeepMind in 2013 to play Atari games (DQN paper)
A neural network (NN) is a universal function approximator, i.e. a fit model that can approximate any function (in theory)
The Q-net is a mapping from state to Q-values of all possible actions
Its parameters aka. weights are adjusted according to the TD rule, like the Q-table
Exercise 6
a) Repeat the same steps as in Ex. 4 for the Q-table learner, but this time using q_function='net' as an argument in the QLearner class. Train it for 1500 episodes. This will take a couple of minutes.
b) Compare the Q-values and policy to the one obtained with Q-table learning. Do you see differences? Why could that be?
from qlearning.core import Maze, QLearner
import tensorflow as tf
tf.keras.utils.set_random_seed(0)
env = Maze(width=4, height=3, fire_positions=[[2, 1], [2, 2]])
qnet_learner = QLearner(env, q_function='net') # FILL HERE)
qnet_learner.train(1500) # FILL HERE)
0%| | 0/1500 [00:00<?, ?it/s]
# Again, if you face any issues with model training, please use
# the saved q-net weights of a trained agent by uncommenting
# the following. You can comment the line for training in the
# previous cell.
# qnet_learner.q_func.load_model('saved_agents/qnet_ex6')
from qlearning.plot_utils import plot_q_table, plot_greedy_policy
q_table = qnet_learner.q_func.get_q_table()
ax = env.plot(add_player_position=False, title=False)
plot_q_table(q_table, env.target_position, env.fire_positions, ax=ax)
policy = qnet_learner.q_func.get_greedy_policy()
ax = env.plot(add_player_position=False, title=False)
plot_greedy_policy(policy, env.target_position, env.fire_positions, ax=ax)
Q-table
Easy to understand and validate
Discrete $S$, $A$ spaces only
Relatively small $S$, $A$ spaces only
DQN
Big and continuous $S$ possible
No need to visit all states during training, because NNs are great interpolators
Discrete and relatively small $A$
Training may be unstable and harder to verify if we have reached convergence
Many real-world problems require continuous $S$ and continuous $A$ $\,\,\Rightarrow\,\,$ actor-critic methods
Two NNs
Actor
Critic
N.B.: networks are trained simultaneously
image by AWAKE Collaboration
Goal: given measured beam positions (= continuous state), find best dipole corrector settings (= continuous actions) to keep beam close to the center of vacuum pipe
State: 10-d array of beam positions measured along the line
Action: 10-d array of dipole corrector strengths along the line
Reward: negative rms of beam offsets wrt. center
Exercise 7
Let's try to train an actor-critic agent on the AWAKE environment! We are using the DDPG (Deep Deterministic Policy Gradient) algorithm. It is one of the most basic actor-critic algorithms and hence also not the most stable one. Some improvements have been implemented in TD3.
7 a) Run the following cell to initialize the AWAKE simulation environment env and a DDPG instance agent. Then reset the environment to misteer the beam, and plot the trajectory. The plot shows the beam position at the 10 BPMs installed along the electron beam line.
# Exercise 7 a)
from actor_critic.awake_env import e_trajectory
from actor_critic.core import ClassicalDDPG, trainer, plot_training_log, run_correction
import tensorflow as tf
tf.keras.utils.set_random_seed(12345)
env = e_trajectory()
agent = ClassicalDDPG(state_space=env.observation_space, action_space=env.action_space)
env.reset(init_outside_threshold=True)
env.plot_trajectory()
7 b) Run the next cell to make a correction to the beam position. Run the cell multiple times and check how the trajectories before and after correction compare. Do you think the RL agent is doing a good job? Why or why not?
# Exercise 7 b)
run_correction(env, agent)
7 c) Run the next cell to train the RL agent. Can you interpret the output plots showing evolution of agent training? Is the length of training appropriate or should we train with fewer / more steps?
Hints: the output figure shows two axes. The top graph displays the length of each episode over the entire training period (an episode is terminated either when the objective is reached or whenever the agent cannot solve the task after 30 steps). The bottom plot shows the rewards (negative trajectory rms) at the beginning and at the end of an episode. A high negative reward means that the trajectory is badly steered. A reward close to zero on the other hand corresponds to a well-corrected beam trajectory.
# Exercise 7 c)
training_log = trainer(env=env, agent=agent, n_steps=500)
0%| | 0/500 [00:00<?, ?it/s]
EPISODE: 0, INITIAL REWARD: -131.397, FINAL REWARD: -109.155, #STEPS: 1. EPISODE: 50, INITIAL REWARD: -308.231, FINAL REWARD: -74.137, #STEPS: 5. EPISODE: 100, INITIAL REWARD: -214.77, FINAL REWARD: -47.199, #STEPS: 1. EPISODE: 150, INITIAL REWARD: -349.32, FINAL REWARD: -8.142, #STEPS: 1. EPISODE: 200, INITIAL REWARD: -117.818, FINAL REWARD: -60.099, #STEPS: 1. EPISODE: 250, INITIAL REWARD: -52.63, FINAL REWARD: -5.303, #STEPS: 1.
plot_training_log(env, agent, training_log)
# In case you face any issues with the training of the agent,
# avoid executing the previous cell and use instead the
# following line to load pre-trained weights.
# agent.load_actor_critic_weights('saved_agents/ddpg_ex7')
7 d) Check a few trajectories before and after correction now using the trained agent (run the cell multiple times). How does the agent perform now?
Note that we are using the DDPG algorithm here which is one of the most basic actor-critic RL algorithms. This is also the reason why the trajectory correction will not always be perfect. There are much improved versions that would train in a shorter time and with better performance (e.g. Twin Delayed DDPG aka TD3).
# Exercise 7 d)
run_correction(env, agent)
Reinforcement learning (RL) is concerned with solving decision-making problems and optimizing for best behavior in an environment
Different algorithms exist with Q-learning being one of the fundamental ones
Q-learning uses a state-action value function $Q(s,a)$ that estimates the expected return
$Q(s, a)$ is iteratively learned following the temporal difference rule
Once converged we can read off the optimal policy by acting greedily with respect to Q
Q-learning
Actor-critic methods are built on top of Q-learning and use two networks: