Reinforcement Learning Hvass Tutorial Q-Learning

15 minute read

Reinforcement Learning Q-Learning(Off-policy Function Approximation)

Introduction

Atari games
Agent just learns how to play it from trial and error
Input is the screen output of the game and whether the previous action resulted in a reward or penalty
Playing Atari with Deep RL paper in Deepmind
Human-level control through deep reinforcement learning paper
The basic idea is to have the agent estimate so-called Q-values from the image
The Q-values tell the agent which action is most likely to lead to the highest cumulative reward in the future
finding Q-values and storing them for later retrieval using a function approximator

The Problem

You are controlling the paddle at the bottom
The goal is maximize the score of smashing the bricks in the wall
You must avoid dying by letting the ball pass beside the paddle
Training the states and Q-values below process

Q-Learning

Q-Learning (Off-policy TD Control)
in sutton’s book
The Q-values indicate which action is expected to result in the highest reward
we have to estimate Q-values
The Q-values are initialized to zero and updated repeatedly as new information is collected from the agent
Q-value for state and action = reward + discount * max Q-value for next state

Simple Example

The images below demonstrates how Q-values are updated in a backwards
The Agent gets a reward +1 in the right most image
This reward is then propagated backwards to the previous game-states
The discounting is an exponentially decreasing function

Detailed Example

For the action NOOP in state t is estimated to be 2.9 which is the highest Q-value for that state
So the agent doesn’t do anything between state t and t+1
In state t+1, the agent scores 4 points but this is limited to 1 point in this implementation as as to stabilize the training
The maximum Q-value for state t+1 is 1.83
we update the Q-value to incorporate this new information
The new Q-value is 2.775 which is slightly lower than the previous 2.9
The idea is to have the agent play many,many and update the estimates of the Q-value about reward and penalties

Motion Trace

we can’t know which direction the ball is moving if we just use the single image
The typical solution is to use multiple consecutive images to represent the state of the game-environment
We use another approach
The left image is from the game-environmnet and the right image is the processed image
The right image shows traces of recent movements
We can see the ball is going downwards and has bounced off the right wall
and then the paddle has moved from the left to the right

Training Stability

consider the 3 images below which show the game-environment in 3 consecutive states
At state t+1, the score is +1 and it should be 0.97 for state t
for state t+2, the Neural Network will also estimate a Q-value near 1.0
It is because the images are so similar
But this is clearly wrong because the Q-values for state t+2 should be zero as we don’t know anything about future rewards at this point
For this reason, we will use a so-called Replay Memory so we can get gather a large number of memory of game-state and shuffle them during training the NN

FlowChart

This flowchart has two main loops
The first loop is for playing the game and recording data
- NN estimates the Q-values and stores the game-state in the Replay Memory
The second loop is activated when the Replay Memory is sufficiently full
- First it makes a full backwards propagation
- Then it optimizes NN

Neural Network Architecture

The NN has 3 convolutional layers, all of which have filter-size 3x3
The layers have 16,32, and 64 output channels
The stride is 2 in the first two CNN and 1 in the last layer
Following the 3 convolutional layers there are 4 fully connected layer each with 1024 units and ReLU activation

Code Analysis

Game Environment

env_name = ‘Breakout-v0’

Download Pre-Trained Model

You can download a Tensorflow checkpoint which holds all the pre-trained variables for the NN
150 hours with 2.6Ghz CPU and GTX 1070 GPU
The tensorflow checkpoint cann’t be used with newer versions of the gym and atari-py

Hyper parameter

# Description of this program.
desc = "Reinformenct Learning (Q-learning) for Atari Games using TensorFlow."

# Create the argument parser.
parser = argparse.ArgumentParser(description=desc)

# Add arguments to the parser.
parser.add_argument("--env", required=False, default='Breakout-v0',
                    help="name of the game-environment in OpenAI Gym")

parser.add_argument("--training", required=False,
                    dest='training', action='store_true',
                    help="train the agent (otherwise test the agent)")

parser.add_argument("--render", required=False,
                    dest='render', action='store_true',
                    help="render game-output to screen")

parser.add_argument("--episodes", required=False, type=int, default=None,
                    help="number of episodes to run")

parser.add_argument("--dir", required=False, default=checkpoint_base_dir,
                    help="directory for the checkpoint and log-files")

# Parse the command-line arguments.
args = parser.parse_args()

# Get the arguments.
env_name = args.env
training = args.training
render = args.render
num_episodes = args.episodes
checkpoint_base_dir = args.dir

Create Agent

The Agent class implements the playing the game, recording data and optimizing the NN

training = True means replay-memory to record states and Q-values

agent = rl.Agent(env_name=env_name,
               training=True,
               render=True,
               use_logging=False)

model = agent.model

replay_memory = agent.replay_memory

class Agent:
    """
    This implements the function for running the game-environment with
    an agent that uses Reinforcement Learning. This class also creates
    instances of the Replay Memory and Neural Network.
    """

    def __init__(self, env_name, training, render=False, use_logging=True):
        """
        Create an object-instance. This also creates a new object for the
        Replay Memory and the Neural Network.
        
        Replay Memory will only be allocated if training==True.

        :param env_name:
            Name of the game-environment in OpenAI Gym.
            Examples: 'Breakout-v0' and 'SpaceInvaders-v0'

        :param training:
            Boolean whether to train the agent and Neural Network (True),
            or test the agent by playing a number of episodes of the game (False).
        
        :param render:
            Boolean whether to render the game-images to screen during testing.

        :param use_logging:
            Boolean whether to use logging to text-files during training.
        """

        # Create the game-environment using OpenAI Gym.
        self.env = gym.make(env_name)

        # The number of possible actions that the agent may take in every step.
        self.num_actions = self.env.action_space.n

Above codes are :
- create gym.make(env_name)
- get action number from gym environment

        
        # List of string-names for the actions in the game-environment.
        self.action_names = self.env.unwrapped.get_action_meanings()

        # Epsilon-greedy policy for selecting an action from the Q-values.
        # During training the epsilon is decreased linearly over the given
        # number of iterations. During testing the fixed epsilon is used.
        self.epsilon_greedy = EpsilonGreedy(start_value=1.0,
                                            end_value=0.1,
                                            num_iterations=1e6,
                                            num_actions=self.num_actions,
                                            epsilon_testing=0.01)

Above codes are:

create epsilon_greedy policy
action probability is below than epsilon -> choose random prob

otherwise use argmax

# With probability epsilon.
  if np.random.random() < epsilon:
      # Select a random action.
      action = np.random.randint(low=0, high=self.num_actions)
  else:
      # Otherwise select the action that has the highest Q-value.
      action = np.argmax(q_values)

        if self.training:
            # The following control-signals are only used during training.

            # The learning-rate for the optimizer decreases linearly.
            self.learning_rate_control = LinearControlSignal(start_value=1e-3,
                                                             end_value=1e-5,
                                                             num_iterations=5e6)

            # The loss-limit is used to abort the optimization whenever the
            # mean batch-loss falls below this limit.
            self.loss_limit_control = LinearControlSignal(start_value=0.1,
                                                          end_value=0.015,
                                                          num_iterations=5e6)

            # The maximum number of epochs to perform during optimization.
            # This is increased from 5 to 10 epochs, because it was found for
            # the Breakout-game that too many epochs could be harmful early
            # in the training, as it might cause over-fitting.
            # Later in the training we would occasionally get rare events
            # and would therefore have to optimize for more iterations
            # because the learning-rate had been decreased.
            self.max_epochs_control = LinearControlSignal(start_value=5.0,
                                                          end_value=10.0,
                                                          num_iterations=5e6)

            # The fraction of the replay-memory to be used.
            # Early in the training, we want to optimize more frequently
            # so the Neural Network is trained faster and the Q-values
            # are learned and updated more often. Later in the training,
            # we need more samples in the replay-memory to have sufficient
            # diversity, otherwise the Neural Network will over-fit.
            self.replay_fraction = LinearControlSignal(start_value=0.1,
                                                       end_value=1.0,
                                                       num_iterations=5e6)
       

        
            # We only create the replay-memory when we are training the agent,
            # because it requires a lot of RAM. The image-frames from the
            # game-environment are resized to 105 x 80 pixels gray-scale,
            # and each state has 2 channels (one for the recent image-frame
            # of the game-environment, and one for the motion-trace).
            # Each pixel is 1 byte, so this replay-memory needs more than
            # 3 GB RAM (105 x 80 x 2 x 200000 bytes).

            # self.replay_memory = ReplayMemory(size=200000,
            self.replay_memory = ReplayMemory(size=50000,
                                              num_actions=self.num_actions)

Above codes are
- Training parameters
  - self.learning_rate_control : from 1e-3 to 1e-5
  - self.loss_limit_control : from 0.1 to 0.015
  - self.max_epochs_control : from 5.0 to 10.0
  - self.replay_fraction : from 0.1 to 1.0
  - self.replay_memory : RAM size 200000 ~

        # Create the Neural Network used for estimating Q-values.
        self.model = NeuralNetwork(num_actions=self.num_actions,
                                   replay_memory=self.replay_memory)

        # Log of the rewards obtained in each episode during calls to run()
        self.episode_rewards = []

Above codes are

create neural network

class NeuralNetwork:
# Placeholder variable for inputting states into the Neural Network.
# A state is a multi-dimensional array holding image-frames from
# the game-environment.
self.x = tf.placeholder(dtype=tf.float32, shape=[None] + state_shape, name='x')

# initial weights
init = tf.truncated_normal_initializer(mean=0.0, stddev=2e-2)

import prettytensor as pt
    # Wrap the input to the Neural Network in a PrettyTensor object.
    x_pretty = pt.wrap(self.x)
    # Create the convolutional Neural Network using Pretty Tensor.
    with pt.defaults_scope(activation_fn=tf.nn.relu):
        self.q_values = x_pretty. \
            conv2d(kernel=3, depth=16, stride=2, name='layer_conv1', weights=init). \
            conv2d(kernel=3, depth=32, stride=2, name='layer_conv2', weights=init). \
            conv2d(kernel=3, depth=64, stride=1, name='layer_conv3', weights=init). \
            flatten(). \
            fully_connected(size=1024, name='layer_fc1', weights=init). \
            fully_connected(size=1024, name='layer_fc2', weights=init). \
            fully_connected(size=1024, name='layer_fc3', weights=init). \
            fully_connected(size=1024, name='layer_fc4', weights=init). \
            fully_connected(size=num_actions, name='layer_fc_out', weights=init,
                            activation_fn=None)
    # Loss-function which must be optimized. This is the mean-squared
    # error between the Q-values that are output by the Neural Network
    # and the target Q-values.
    self.loss = self.q_values.l2_regression(target=self.q_values_new)

    self.optimizer = tf.train.RMSPropOptimizer(learning_rate=self.learning_rate).minimize(self.loss)
    # Used for saving and loading checkpoints.
    self.saver = tf.train.Saver()

Training

The agent’s run() is used to play the game
```
agent.run(num_episodes=2)
```
run() function

    def run(self, num_episodes=None):
        """
        Run the game-environment and use the Neural Network to decide
        which actions to take in each step through Q-value estimates.
        
        :param num_episodes: 
            Number of episodes to process in the game-environment.
            If None then continue forever. This is useful during training
            where you might want to stop the training using Ctrl-C instead.
        """
        ... 

        # Counter for the number of states we have processed.
        # This is stored in the TensorFlow graph so it can be
        # saved and reloaded along with the checkpoint.
        count_states = self.model.get_count_states()

class NeuralNetwork:
	self.count_states = tf.Variable(initial_value=0,
                            trainable=False, dtype=tf.int64,
                            name='count_states')

        ...

        while count_episodes <= num_episodes:
            if end_episode:
                # Reset the game-environment and get the first image-frame.
                img = self.env.reset()

                # Create a new motion-tracer for processing images from the
                # game-environment. Initialize with the first image-frame.
                # This resets the motion-tracer so the trace starts again.
                # This could also be done if end_life==True.
                motion_tracer = MotionTracer(img)
        ...         

Above codes are

end_episode = True which cause a reset in the first iteration
get the first img from gym.env and create class MotionTracer

Reset the game-environment and Motion Tracer

def _pre_process_image(image):
  """Pre-process a raw image from the game-environment."""
  # Convert image to gray-scale.
  img = _rgb_to_grayscale(image)
  #
  # Resize to the desired size using SciPy for convenience.
  img = scipy.misc.imresize(img, size=state_img_size, interp='bicubic')
  return img
  #
class MotionTracer:
  def __init__(self, image, decay=0.75):
      """
      :param image:
          First image from the game-environment,
          used for resetting the motion detector.
      # 
      :param decay:
          Parameter for how long the tail should be on the motion-trace.
          This is a float between 0.0 and 1.0 where higher values means
          the trace / tail is longer.
      """
      # Pre-process the image and save it for later use.
      # The input image may be 8-bit integers but internally
      # we need to use floating-point to avoid image-noise
      # caused by recurrent rounding-errors.
      img = _pre_process_image(image=image)
      self.last_input = img.astype(np.float)

            # Get the state of the game-environment from the motion-tracer.
            # The state has two images: (1) The last image-frame from the game
            # and (2) a motion-trace that shows movement trajectories.
            state = motion_tracer.get_state()

            # Use the Neural Network to estimate the Q-values for the state.
            # Note that the function assumes an array of states and returns
            # a 2-dim array of Q-values, but we just have a single state here.
            q_values = self.model.get_q_values(states=[state])[0]

npdstack

      class MotionTracer:
          def get_state(self):
              # Stack the last input and output images.
              state = np.dstack([self.last_input, self.last_output])
              state = state.astype(np.uint8)
              return state

            # Determine the action that the agent must take in the game-environment.
            # The epsilon is just used for printing further below.
            action, epsilon = self.epsilon_greedy.get_action(q_values=q_values,
                                                             iteration=count_states,
                                                             training=self.training)

class EpsilonGrddy get_action(…)

      class EpsilonGreedy:
      def get_action(self, q_values, iteration, training):
          epsilon = self.get_epsilon(iteration=iteration, training=training)
          #
          # With probability epsilon.
          if np.random.random() < epsilon:
              # Select a random action.
              action = np.random.randint(low=0, high=self.num_actions)
          else:
              # Otherwise select the action that has the highest Q-value.
              action = np.argmax(q_values) 

            # Take a step in the game-environment using the given action.
            # Note that in OpenAI Gym, the step-function actually repeats the
            # action between 2 and 4 time-steps for Atari games, with the number
            # chosen at random.
            img, reward, end_episode, info = self.env.step(action=action)

            # Process the image from the game-environment in the motion-tracer.
            # This will first be used in the next iteration of the loop.
            motion_tracer.process(image=img)

npwhere

  class MotionTracer:
      def process(self, image):
          ...
          img_dif = img - self.last_input
          img_motion = np.where(np.abs(img_dif) > 20, 255.0, 0.0)
          ...

            # Add the reward for the step to the reward for the entire episode.
            reward_episode += reward

            # Determine if a life was lost in this step.
            num_lives_new = self.get_lives()
            end_life = (num_lives_new < num_lives)
            num_lives = num_lives_new

            # Increase the counter for the number of states that have been processed.
            count_states = self.model.increase_count_states()

            ...

            # If we want to train the Neural Network to better estimate Q-values.
            if self.training:
                # Add the state of the game-environment to the replay-memory.
                self.replay_memory.add(state=state,
                                       q_values=q_values,
                                       action=action,
                                       reward=reward,
                                       end_life=end_life,
                                       end_episode=end_episode)

    class ReplayMemory:
         # self.num_actions = self.env.action_space.n
         ...
         self.states = np.zeros(shape=[size] + state_shape, dtype=np.uint8)
         self.q_values = np.zeros(shape=[size, num_actions], dtype=np.float)
         self.actions = np.zeros(shape=size, dtype=np.int)
         self.rewards = np.zeros(shape=size, dtype=np.float)
         ...

                # How much of the replay-memory should be used.
                use_fraction = self.replay_fraction.get_value(iteration=count_states)

                # When the replay-memory is sufficiently full.
                if self.replay_memory.is_full() \
                    or self.replay_memory.used_fraction() > use_fraction:

                    # Update all Q-values in the replay-memory through a backwards-sweep.
                    self.replay_memory.update_all_q_values()

    class ReplayMemory:
        ...
        def update_all_q_values(self):
                
            # Copy old Q-values so we can print their statistics later.
            # Note that the contents of the arrays are copied.
            self.q_values_old[:] = self.q_values[:]
            # num_used is total number of stored states
            for k in reversed(range(self.num_used-1)):
                # Get the data for the k'th state in the replay-memory.
                action = self.actions[k]
                reward = self.rewards[k]
                end_life = self.end_life[k]
                end_episode = self.end_episode[k]
                    
                # Calculate the Q-value for the action that was taken in this state.
                if end_life or end_episode:
                    action_value = reward
                else:
                    action_value = reward + self.discount_factor * np.max(self.q_values[k + 1])
                # Error of the Q-value that was estimated using the Neural Network.
                self.estimation_errors[k] = abs(action_value - self.q_values[k, action])
                    
                # Update the Q-value with the better estimate.
                self.q_values[k, action] = action_value
        ...

                    ...
                    # Get the control parameters for optimization of the Neural Network.
                    # These are changed linearly depending on the state-counter.
                    learning_rate = self.learning_rate_control.get_value(iteration=count_states)
                    loss_limit = self.loss_limit_control.get_value(iteration=count_states)
                    max_epochs = self.max_epochs_control.get_value(iteration=count_states)

                    # Perform an optimization run on the Neural Network so as to
                    # improve the estimates for the Q-values.
                    # This will sample random batches from the replay-memory.
                    self.model.optimize(learning_rate=learning_rate,
                                        loss_limit=loss_limit,
                                        max_epochs=max_epochs)

     class NeuralNetwork:
         def optimize(self, min_epochs=1.0, max_epochs=10,
             batch_size=128, loss_limit=0.015,
             learning_rate=1e-3):
             ...
             # Prepare the probability distribution for sampling the replay-memory.
             self.replay_memory.prepare_sampling_prob(batch_size=batch_size)
               
             # Number of optimization iterations corresponding to one epoch.
             iterations_per_epoch = self.replay_memory.num_used / batch_size
          
             # Minimum number of iterations to perform.
             min_iterations = int(iterations_per_epoch * min_epochs)
        
             # Maximum number of iterations to perform.
             max_iterations = int(iterations_per_epoch * max_epochs)
       
             # Buffer for storing the loss-values of the most recent batches.
             loss_history = np.zeros(100, dtype=float)
        
             for i in range(max_iterations):
                 # Randomly sample a batch of states and target Q-values
                 # from the replay-memory. These are the Q-values that we
                 # want the Neural Network to be able to estimate.
                 state_batch, q_values_batch = self.replay_memory.random_batch()

nprandomchoice,npconcatenate

     class ReplayMemory:
         def random_batch(self):
             ...
             idx_lo = np.random.choice(self.idx_err_lo,
                           size=self.num_samples_err_lo,
                           replace=False)
             idx_hi = np.random.choice(self.idx_err_hi,
                           size=self.num_samples_err_hi,
                           replace=False)
             idx = np.concatenate((idx_lo, idx_hi))
                    
             states_batch = self.states[idx]
             q_values_batch = self.q_values[idx] 

                 # Create a feed-dict for inputting the data to the TensorFlow graph.
                 # Note that the learning-rate is also in this feed-dict.
                 feed_dict = {self.x: state_batch,
                             self.q_values_new: q_values_batch,
                             self.learning_rate: learning_rate}
        
                 # Perform one optimization step and get the loss-value.
                 loss_val, _ = self.session.run([self.loss, self.optimizer],
                                               feed_dict=feed_dict)
      
                 # Shift the loss-history and assign the new value.
                 # This causes the loss-history to only hold the most recent values.
                 loss_history = np.roll(loss_history, 1)
                 loss_history[0] = loss_val
      
                 # Calculate the average loss for the previous batches.
                 loss_mean = np.mean(loss_history)
        
                 # Print status.
                 pct_epoch = i / iterations_per_epoch
                 msg = "\tIteration: {0} ({1:.2f} epoch), Batch loss: {2:.4f}, Mean loss: {3:.4f}"
                 msg = msg.format(i, pct_epoch, loss_val, loss_mean)
                 print_progress(msg)
         
                 # Stop the optimization if we have performed the required number
                 # of iterations and the loss-value is sufficiently low.
                 if i > min_iterations and loss_mean < loss_limit:
                    break
    

                    # Save a checkpoint of the Neural Network so we can reload it.
                    self.model.save_checkpoint(count_states)

                    # Reset the replay-memory. This throws away all the data we have
                    # just gathered, so we will have to fill the replay-memory again.
                    self.replay_memory.reset()

            if end_episode:
                # Add the episode's reward to a list for calculating statistics.
                self.episode_rewards.append(reward_episode)

            # Mean reward of the last 30 episodes.
            if len(self.episode_rewards) == 0:
                # The list of rewards is empty.
                reward_mean = 0.0
            else:
                reward_mean = np.mean(self.episode_rewards[-30:])

            if self.training and end_episode:
                # Log reward to file.
                if self.use_logging:
                    self.log_reward.write(count_episodes=count_episodes,
                                          count_states=count_states,
                                          reward_episode=reward_episode,
                                          reward_mean=reward_mean)

                # Print reward to screen.
                msg = "{0:4}:{1}\t Epsilon: {2:4.2f}\t Reward: {3:.1f}\t Episode Mean: {4:.1f}"
                print(msg.format(count_episodes, count_states, epsilon,
                                 reward_episode, reward_mean))
            elif not self.training and (reward != 0.0 or end_life or end_episode):
                # Print Q-values and reward to screen.
                msg = "{0:4}:{1}\tQ-min: {2:5.3f}\tQ-max: {3:5.3f}\tLives: {4}\tReward: {5:.1f}\tEpisode Mean: {6:.1f}"
                print(msg.format(count_episodes, count_states, np.min(q_values),
                                 np.max(q_values), num_lives, reward_episode, reward_mean))

Reference sites

Hvass tutorial github

Tutorials

Simple Linear Model (Notebook)
Convolutional Neural Network (Notebook)
Pretty Tensor (Notebook)

3-B. Layers API (Notebook)

3-C. Keras API (Notebook)

Save & Restore (Notebook)
Ensemble Learning (Notebook)
CIFAR-10 (Notebook)
Inception Model (Notebook)
Transfer Learning (Notebook)
Video Data (Notebook)
Fine-Tuning (Notebook)
Adversarial Examples (Notebook)
Adversarial Noise for MNIST (Notebook)
Visual Analysis (Notebook)

13-B. Visual Analysis for MNIST (Notebook)

DeepDream (Notebook)
Style Transfer (Notebook)
Reinforcement Learning (Notebook)
Estimator API (Notebook)
TFRecords & Dataset API (Notebook)
Hyper-Parameter Optimization (Notebook)
Natural Language Processing (Notebook)
Machine Translation (Notebook)
Image Captioning (Notebook)
Time-Series Prediction (Notebook)

Share on

Twitter Facebook Google+ LinkedIn

Mr Ko