Reinforcement Learning Hvass Tutorial Q-Learning
Reinforcement Learning Q-Learning(Off-policy Function Approximation)
Introduction
- Atari games
- Agent just learns how to play it from trial and error
- Input is the screen output of the game and whether the previous action resulted in a reward or penalty
- Playing Atari with Deep RL paper in Deepmind
- Human-level control through deep reinforcement learning paper
- The basic idea is to have the agent estimate so-called Q-values from the image
- The Q-values tell the agent which action is most likely to lead to the highest cumulative reward in the future
- finding Q-values and storing them for later retrieval using a function approximator
The Problem
- You are controlling the paddle at the bottom
- The goal is maximize the score of smashing the bricks in the wall
- You must avoid dying by letting the ball pass beside the paddle
- Training the states and Q-values below process
Q-Learning
- Q-Learning (Off-policy TD Control)
in sutton’s book - The Q-values indicate which action is expected to result in the highest reward
- we have to estimate Q-values
- The Q-values are initialized to zero and updated repeatedly as new information is collected from the agent
- Q-value for state and action = reward + discount * max Q-value for next state
Simple Example
- The images below demonstrates how Q-values are updated in a backwards
- The Agent gets a reward +1 in the right most image
- This reward is then propagated backwards to the previous game-states
- The discounting is an exponentially decreasing function
Detailed Example
- For the action NOOP in state t is estimated to be 2.9 which is the highest Q-value for that state
- So the agent doesn’t do anything between state t and t+1
- In state t+1, the agent scores 4 points but this is limited to 1 point in this implementation as as to stabilize the training
- The maximum Q-value for state t+1 is 1.83
- we update the Q-value to incorporate this new information
- The new Q-value is 2.775 which is slightly lower than the previous 2.9
- The idea is to have the agent play many,many and update the estimates of the Q-value about reward and penalties
Motion Trace
- we can’t know which direction the ball is moving if we just use the single image
- The typical solution is to use multiple consecutive images to represent the state of the game-environment
- We use another approach
- The left image is from the game-environmnet and the right image is the processed image
- The right image shows traces of recent movements
- We can see the ball is going downwards and has bounced off the right wall
- and then the paddle has moved from the left to the right
Training Stability
- consider the 3 images below which show the game-environment in 3 consecutive states
- At state t+1, the score is +1 and it should be 0.97 for state t
- for state t+2, the Neural Network will also estimate a Q-value near 1.0
- It is because the images are so similar
- But this is clearly wrong because the Q-values for state t+2 should be zero as we don’t know anything about future rewards at this point
- For this reason, we will use a so-called Replay Memory so we can get gather a large number of memory of game-state and shuffle them during training the NN
FlowChart
- This flowchart has two main loops
- The first loop is for playing the game and recording data
- NN estimates the Q-values and stores the game-state in the Replay Memory
- The second loop is activated when the Replay Memory is sufficiently full
- First it makes a full backwards propagation
- Then it optimizes NN
Neural Network Architecture
- The NN has 3 convolutional layers, all of which have filter-size 3x3
- The layers have 16,32, and 64 output channels
- The stride is 2 in the first two CNN and 1 in the last layer
- Following the 3 convolutional layers there are 4 fully connected layer each with 1024 units and ReLU activation
Code Analysis
Game Environment
env_name = ‘Breakout-v0’
Download Pre-Trained Model
- You can download a Tensorflow checkpoint which holds all the pre-trained variables for the NN
- 150 hours with 2.6Ghz CPU and GTX 1070 GPU
- The tensorflow checkpoint cann’t be used with newer versions of the gym and atari-py
Hyper parameter
# Description of this program.
desc = "Reinformenct Learning (Q-learning) for Atari Games using TensorFlow."
# Create the argument parser.
parser = argparse.ArgumentParser(description=desc)
# Add arguments to the parser.
parser.add_argument("--env", required=False, default='Breakout-v0',
help="name of the game-environment in OpenAI Gym")
parser.add_argument("--training", required=False,
dest='training', action='store_true',
help="train the agent (otherwise test the agent)")
parser.add_argument("--render", required=False,
dest='render', action='store_true',
help="render game-output to screen")
parser.add_argument("--episodes", required=False, type=int, default=None,
help="number of episodes to run")
parser.add_argument("--dir", required=False, default=checkpoint_base_dir,
help="directory for the checkpoint and log-files")
# Parse the command-line arguments.
args = parser.parse_args()
# Get the arguments.
env_name = args.env
training = args.training
render = args.render
num_episodes = args.episodes
checkpoint_base_dir = args.dir
Create Agent
- The Agent class implements the playing the game, recording data and optimizing the NN
- training = True means replay-memory to record states and Q-values
agent = rl.Agent(env_name=env_name, training=True, render=True, use_logging=False)
model = agent.model
replay_memory = agent.replay_memory
class Agent:
"""
This implements the function for running the game-environment with
an agent that uses Reinforcement Learning. This class also creates
instances of the Replay Memory and Neural Network.
"""
def __init__(self, env_name, training, render=False, use_logging=True):
"""
Create an object-instance. This also creates a new object for the
Replay Memory and the Neural Network.
Replay Memory will only be allocated if training==True.
:param env_name:
Name of the game-environment in OpenAI Gym.
Examples: 'Breakout-v0' and 'SpaceInvaders-v0'
:param training:
Boolean whether to train the agent and Neural Network (True),
or test the agent by playing a number of episodes of the game (False).
:param render:
Boolean whether to render the game-images to screen during testing.
:param use_logging:
Boolean whether to use logging to text-files during training.
"""
# Create the game-environment using OpenAI Gym.
self.env = gym.make(env_name)
# The number of possible actions that the agent may take in every step.
self.num_actions = self.env.action_space.n
- Above codes are :
- create gym.make(env_name)
- get action number from gym environment
# List of string-names for the actions in the game-environment.
self.action_names = self.env.unwrapped.get_action_meanings()
# Epsilon-greedy policy for selecting an action from the Q-values.
# During training the epsilon is decreased linearly over the given
# number of iterations. During testing the fixed epsilon is used.
self.epsilon_greedy = EpsilonGreedy(start_value=1.0,
end_value=0.1,
num_iterations=1e6,
num_actions=self.num_actions,
epsilon_testing=0.01)
- Above codes are:
- create epsilon_greedy policy
- action probability is below than epsilon -> choose random prob
- otherwise use argmax
# With probability epsilon. if np.random.random() < epsilon: # Select a random action. action = np.random.randint(low=0, high=self.num_actions) else: # Otherwise select the action that has the highest Q-value. action = np.argmax(q_values)
if self.training:
# The following control-signals are only used during training.
# The learning-rate for the optimizer decreases linearly.
self.learning_rate_control = LinearControlSignal(start_value=1e-3,
end_value=1e-5,
num_iterations=5e6)
# The loss-limit is used to abort the optimization whenever the
# mean batch-loss falls below this limit.
self.loss_limit_control = LinearControlSignal(start_value=0.1,
end_value=0.015,
num_iterations=5e6)
# The maximum number of epochs to perform during optimization.
# This is increased from 5 to 10 epochs, because it was found for
# the Breakout-game that too many epochs could be harmful early
# in the training, as it might cause over-fitting.
# Later in the training we would occasionally get rare events
# and would therefore have to optimize for more iterations
# because the learning-rate had been decreased.
self.max_epochs_control = LinearControlSignal(start_value=5.0,
end_value=10.0,
num_iterations=5e6)
# The fraction of the replay-memory to be used.
# Early in the training, we want to optimize more frequently
# so the Neural Network is trained faster and the Q-values
# are learned and updated more often. Later in the training,
# we need more samples in the replay-memory to have sufficient
# diversity, otherwise the Neural Network will over-fit.
self.replay_fraction = LinearControlSignal(start_value=0.1,
end_value=1.0,
num_iterations=5e6)
# We only create the replay-memory when we are training the agent,
# because it requires a lot of RAM. The image-frames from the
# game-environment are resized to 105 x 80 pixels gray-scale,
# and each state has 2 channels (one for the recent image-frame
# of the game-environment, and one for the motion-trace).
# Each pixel is 1 byte, so this replay-memory needs more than
# 3 GB RAM (105 x 80 x 2 x 200000 bytes).
# self.replay_memory = ReplayMemory(size=200000,
self.replay_memory = ReplayMemory(size=50000,
num_actions=self.num_actions)
- Above codes are
- Training parameters
- self.learning_rate_control : from 1e-3 to 1e-5
- self.loss_limit_control : from 0.1 to 0.015
- self.max_epochs_control : from 5.0 to 10.0
- self.replay_fraction : from 0.1 to 1.0
- self.replay_memory : RAM size 200000 ~
- Training parameters
# Create the Neural Network used for estimating Q-values.
self.model = NeuralNetwork(num_actions=self.num_actions,
replay_memory=self.replay_memory)
# Log of the rewards obtained in each episode during calls to run()
self.episode_rewards = []
- Above codes are
- create neural network
-
class NeuralNetwork: # Placeholder variable for inputting states into the Neural Network. # A state is a multi-dimensional array holding image-frames from # the game-environment. self.x = tf.placeholder(dtype=tf.float32, shape=[None] + state_shape, name='x') # initial weights init = tf.truncated_normal_initializer(mean=0.0, stddev=2e-2) import prettytensor as pt # Wrap the input to the Neural Network in a PrettyTensor object. x_pretty = pt.wrap(self.x) # Create the convolutional Neural Network using Pretty Tensor. with pt.defaults_scope(activation_fn=tf.nn.relu): self.q_values = x_pretty. \ conv2d(kernel=3, depth=16, stride=2, name='layer_conv1', weights=init). \ conv2d(kernel=3, depth=32, stride=2, name='layer_conv2', weights=init). \ conv2d(kernel=3, depth=64, stride=1, name='layer_conv3', weights=init). \ flatten(). \ fully_connected(size=1024, name='layer_fc1', weights=init). \ fully_connected(size=1024, name='layer_fc2', weights=init). \ fully_connected(size=1024, name='layer_fc3', weights=init). \ fully_connected(size=1024, name='layer_fc4', weights=init). \ fully_connected(size=num_actions, name='layer_fc_out', weights=init, activation_fn=None) # Loss-function which must be optimized. This is the mean-squared # error between the Q-values that are output by the Neural Network # and the target Q-values. self.loss = self.q_values.l2_regression(target=self.q_values_new) self.optimizer = tf.train.RMSPropOptimizer(learning_rate=self.learning_rate).minimize(self.loss) # Used for saving and loading checkpoints. self.saver = tf.train.Saver()
Training
- The agent’s run() is used to play the game
agent.run(num_episodes=2)
- run() function
def run(self, num_episodes=None):
"""
Run the game-environment and use the Neural Network to decide
which actions to take in each step through Q-value estimates.
:param num_episodes:
Number of episodes to process in the game-environment.
If None then continue forever. This is useful during training
where you might want to stop the training using Ctrl-C instead.
"""
...
# Counter for the number of states we have processed.
# This is stored in the TensorFlow graph so it can be
# saved and reloaded along with the checkpoint.
count_states = self.model.get_count_states()
-
class NeuralNetwork: self.count_states = tf.Variable(initial_value=0, trainable=False, dtype=tf.int64, name='count_states')
...
while count_episodes <= num_episodes:
if end_episode:
# Reset the game-environment and get the first image-frame.
img = self.env.reset()
# Create a new motion-tracer for processing images from the
# game-environment. Initialize with the first image-frame.
# This resets the motion-tracer so the trace starts again.
# This could also be done if end_life==True.
motion_tracer = MotionTracer(img)
...
- Above codes are
- end_episode = True which cause a reset in the first iteration
- get the first img from gym.env and create class MotionTracer
- Reset the game-environment and Motion Tracer
def _pre_process_image(image): """Pre-process a raw image from the game-environment.""" # Convert image to gray-scale. img = _rgb_to_grayscale(image) # # Resize to the desired size using SciPy for convenience. img = scipy.misc.imresize(img, size=state_img_size, interp='bicubic') return img # class MotionTracer: def __init__(self, image, decay=0.75): """ :param image: First image from the game-environment, used for resetting the motion detector. # :param decay: Parameter for how long the tail should be on the motion-trace. This is a float between 0.0 and 1.0 where higher values means the trace / tail is longer. """ # Pre-process the image and save it for later use. # The input image may be 8-bit integers but internally # we need to use floating-point to avoid image-noise # caused by recurrent rounding-errors. img = _pre_process_image(image=image) self.last_input = img.astype(np.float)
# Get the state of the game-environment from the motion-tracer.
# The state has two images: (1) The last image-frame from the game
# and (2) a motion-trace that shows movement trajectories.
state = motion_tracer.get_state()
# Use the Neural Network to estimate the Q-values for the state.
# Note that the function assumes an array of states and returns
# a 2-dim array of Q-values, but we just have a single state here.
q_values = self.model.get_q_values(states=[state])[0]
- npdstack
class MotionTracer: def get_state(self): # Stack the last input and output images. state = np.dstack([self.last_input, self.last_output]) state = state.astype(np.uint8) return state
# Determine the action that the agent must take in the game-environment.
# The epsilon is just used for printing further below.
action, epsilon = self.epsilon_greedy.get_action(q_values=q_values,
iteration=count_states,
training=self.training)
- class EpsilonGrddy get_action(…)
class EpsilonGreedy: def get_action(self, q_values, iteration, training): epsilon = self.get_epsilon(iteration=iteration, training=training) # # With probability epsilon. if np.random.random() < epsilon: # Select a random action. action = np.random.randint(low=0, high=self.num_actions) else: # Otherwise select the action that has the highest Q-value. action = np.argmax(q_values)
# Take a step in the game-environment using the given action.
# Note that in OpenAI Gym, the step-function actually repeats the
# action between 2 and 4 time-steps for Atari games, with the number
# chosen at random.
img, reward, end_episode, info = self.env.step(action=action)
# Process the image from the game-environment in the motion-tracer.
# This will first be used in the next iteration of the loop.
motion_tracer.process(image=img)
- npwhere
class MotionTracer: def process(self, image): ... img_dif = img - self.last_input img_motion = np.where(np.abs(img_dif) > 20, 255.0, 0.0) ...
# Add the reward for the step to the reward for the entire episode.
reward_episode += reward
# Determine if a life was lost in this step.
num_lives_new = self.get_lives()
end_life = (num_lives_new < num_lives)
num_lives = num_lives_new
# Increase the counter for the number of states that have been processed.
count_states = self.model.increase_count_states()
...
# If we want to train the Neural Network to better estimate Q-values.
if self.training:
# Add the state of the game-environment to the replay-memory.
self.replay_memory.add(state=state,
q_values=q_values,
action=action,
reward=reward,
end_life=end_life,
end_episode=end_episode)
-
class ReplayMemory: # self.num_actions = self.env.action_space.n ... self.states = np.zeros(shape=[size] + state_shape, dtype=np.uint8) self.q_values = np.zeros(shape=[size, num_actions], dtype=np.float) self.actions = np.zeros(shape=size, dtype=np.int) self.rewards = np.zeros(shape=size, dtype=np.float) ...
# How much of the replay-memory should be used.
use_fraction = self.replay_fraction.get_value(iteration=count_states)
# When the replay-memory is sufficiently full.
if self.replay_memory.is_full() \
or self.replay_memory.used_fraction() > use_fraction:
# Update all Q-values in the replay-memory through a backwards-sweep.
self.replay_memory.update_all_q_values()
-
class ReplayMemory: ... def update_all_q_values(self): # Copy old Q-values so we can print their statistics later. # Note that the contents of the arrays are copied. self.q_values_old[:] = self.q_values[:] # num_used is total number of stored states for k in reversed(range(self.num_used-1)): # Get the data for the k'th state in the replay-memory. action = self.actions[k] reward = self.rewards[k] end_life = self.end_life[k] end_episode = self.end_episode[k] # Calculate the Q-value for the action that was taken in this state. if end_life or end_episode: action_value = reward else: action_value = reward + self.discount_factor * np.max(self.q_values[k + 1]) # Error of the Q-value that was estimated using the Neural Network. self.estimation_errors[k] = abs(action_value - self.q_values[k, action]) # Update the Q-value with the better estimate. self.q_values[k, action] = action_value ...
...
# Get the control parameters for optimization of the Neural Network.
# These are changed linearly depending on the state-counter.
learning_rate = self.learning_rate_control.get_value(iteration=count_states)
loss_limit = self.loss_limit_control.get_value(iteration=count_states)
max_epochs = self.max_epochs_control.get_value(iteration=count_states)
# Perform an optimization run on the Neural Network so as to
# improve the estimates for the Q-values.
# This will sample random batches from the replay-memory.
self.model.optimize(learning_rate=learning_rate,
loss_limit=loss_limit,
max_epochs=max_epochs)
-
class NeuralNetwork: def optimize(self, min_epochs=1.0, max_epochs=10, batch_size=128, loss_limit=0.015, learning_rate=1e-3): ... # Prepare the probability distribution for sampling the replay-memory. self.replay_memory.prepare_sampling_prob(batch_size=batch_size) # Number of optimization iterations corresponding to one epoch. iterations_per_epoch = self.replay_memory.num_used / batch_size # Minimum number of iterations to perform. min_iterations = int(iterations_per_epoch * min_epochs) # Maximum number of iterations to perform. max_iterations = int(iterations_per_epoch * max_epochs) # Buffer for storing the loss-values of the most recent batches. loss_history = np.zeros(100, dtype=float) for i in range(max_iterations): # Randomly sample a batch of states and target Q-values # from the replay-memory. These are the Q-values that we # want the Neural Network to be able to estimate. state_batch, q_values_batch = self.replay_memory.random_batch()
-
class ReplayMemory: def random_batch(self): ... idx_lo = np.random.choice(self.idx_err_lo, size=self.num_samples_err_lo, replace=False) idx_hi = np.random.choice(self.idx_err_hi, size=self.num_samples_err_hi, replace=False) idx = np.concatenate((idx_lo, idx_hi)) states_batch = self.states[idx] q_values_batch = self.q_values[idx]
# Create a feed-dict for inputting the data to the TensorFlow graph. # Note that the learning-rate is also in this feed-dict. feed_dict = {self.x: state_batch, self.q_values_new: q_values_batch, self.learning_rate: learning_rate} # Perform one optimization step and get the loss-value. loss_val, _ = self.session.run([self.loss, self.optimizer], feed_dict=feed_dict) # Shift the loss-history and assign the new value. # This causes the loss-history to only hold the most recent values. loss_history = np.roll(loss_history, 1) loss_history[0] = loss_val # Calculate the average loss for the previous batches. loss_mean = np.mean(loss_history) # Print status. pct_epoch = i / iterations_per_epoch msg = "\tIteration: {0} ({1:.2f} epoch), Batch loss: {2:.4f}, Mean loss: {3:.4f}" msg = msg.format(i, pct_epoch, loss_val, loss_mean) print_progress(msg) # Stop the optimization if we have performed the required number # of iterations and the loss-value is sufficiently low. if i > min_iterations and loss_mean < loss_limit: break
-
# Save a checkpoint of the Neural Network so we can reload it.
self.model.save_checkpoint(count_states)
# Reset the replay-memory. This throws away all the data we have
# just gathered, so we will have to fill the replay-memory again.
self.replay_memory.reset()
if end_episode:
# Add the episode's reward to a list for calculating statistics.
self.episode_rewards.append(reward_episode)
# Mean reward of the last 30 episodes.
if len(self.episode_rewards) == 0:
# The list of rewards is empty.
reward_mean = 0.0
else:
reward_mean = np.mean(self.episode_rewards[-30:])
if self.training and end_episode:
# Log reward to file.
if self.use_logging:
self.log_reward.write(count_episodes=count_episodes,
count_states=count_states,
reward_episode=reward_episode,
reward_mean=reward_mean)
# Print reward to screen.
msg = "{0:4}:{1}\t Epsilon: {2:4.2f}\t Reward: {3:.1f}\t Episode Mean: {4:.1f}"
print(msg.format(count_episodes, count_states, epsilon,
reward_episode, reward_mean))
elif not self.training and (reward != 0.0 or end_life or end_episode):
# Print Q-values and reward to screen.
msg = "{0:4}:{1}\tQ-min: {2:5.3f}\tQ-max: {3:5.3f}\tLives: {4}\tReward: {5:.1f}\tEpisode Mean: {6:.1f}"
print(msg.format(count_episodes, count_states, np.min(q_values),
np.max(q_values), num_lives, reward_episode, reward_mean))
Reference sites
Tutorials
3-B. Layers API (Notebook)
3-C. Keras API (Notebook)
-
Save & Restore (Notebook)
-
Ensemble Learning (Notebook)
-
CIFAR-10 (Notebook)
-
Inception Model (Notebook)
-
Transfer Learning (Notebook)
-
Video Data (Notebook)
-
Fine-Tuning (Notebook)
-
Adversarial Examples (Notebook)
-
Adversarial Noise for MNIST (Notebook)
-
Visual Analysis (Notebook)
13-B. Visual Analysis for MNIST (Notebook)
-
DeepDream (Notebook)
-
Style Transfer (Notebook)
-
Reinforcement Learning (Notebook)
-
Estimator API (Notebook)
-
TFRecords & Dataset API (Notebook)
-
Hyper-Parameter Optimization (Notebook)
-
Natural Language Processing (Notebook)
-
Machine Translation (Notebook)
-
Image Captioning (Notebook)
-
Time-Series Prediction (Notebook)