Conditional Imitation Learning CARLA

6 minute read

Conditional Imitation Learning at CARLA

End-to-end Driving via Conditional Imitation Learning (Paper)

Abstract

Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles
However, driving policies trained via imitation learning cannot be controlled at test time
We propose to condition imitation learning on high-level command input

Introduction

Imitation learning can be used to train a model that maps perceptual inputs to control commands
for example, mapping camera images to steering and acceleration
lane following ,off-road obstacle avoidance
However, these systems have characteristic limitations
NVIDIA Implementation
- For example, the network trained by Bojarski et al. [4] was given control over
  lane and road following only. When a lane change or a turn from one road to another were required, the human driver had to take control [4]
Why has imitation learning not scaled up to fully autonomous urban driving?
- One limitation is in the assumption that the optimal action can be inferred from the perceptual input alone
- when a car approaches an intersection, the camera input is not sufficient to predict whether the car should turn left, right, or go straight
In this paper, we address this challenge with conditional imitation learning
- **At training time, the model is given not only the perceptual input and the
  control signal, but also a representation of the expert’s intention. **
- At test time,the network can be given corresponding commands, which resolve the ambiguity in the perceptuomotor mapping
- The trained network is thus freed from the task of planning and can devote its representational capacity to driving
We evaluate the presented approach in realistic simulations of urban driving and
on a 1/5 scale robotic truck

we develop a command-conditional formulation that enables the application of
imitation learning to more complex urban driving.
Another difference is that we learn to control not only steering but also acceleration and braking, enabling the model to assume full control of the car
In reinforcement learning, hierarchical approaches aim to construct multiple
levels of temporally extended sub-policies
Hierarchical approaches have also been combined with deep learning and applied to raw sensory input
In these works, the main aim is to learn purely from experience and discover hierarchical structure automatically
In contrast, we focus on imitation learning, and we provide additional information on the expert’s intentions during demonstration
This formulation makes the learning problem more tractable and yields a human-controllable policy
Parameterized goals have been used to train motion controllers in robotics
Our work shares the idea of training a conditional controller, but differs in the
model architecture,the application domain (vision-based autonomous driving), and the learning method (conditional imitation learning)
On the opposite side are end-to-end approaches that train function approximators
to map sensory input to control commands
we limit commands to a predefined vocabulary such as “turn right at the next
intersection”, “turn left at the next intersection”, and “keep straight”.

Conditional Imitation Learning

However, in more complex scenarios the assumption that the mapping of observations to actions is a function breaks down
these turn signals can be used as commands in our formulation
These test-time commands can come from a human user or a planning module
In urban driving, a typical command would be “turn right at the next
intersection”,which can be provided by a navigation system or a passenger

Methodology

observation : Image measurement : acceleration command : human interface

1) Network Architecture

In our driving experiments, the action space is continuous and
two-dimensional: steering angle and acceleration. The acceleration can be negative, which corresponds to braking or driving backwards. The command c is a categorical variable represented by a one-hot vector
We study two approaches to incorporating the command c into the network.
The first architecture is illustrated in Figure 3(a). The network takes the command as an input, alongside the image and the measurements.
which can lead to suboptimal performance in practice
We therefore designed an alternative architecture, shown in Figure 3(b)
The command c acts as a switch that selects which branch is used at any
given time. The output of the network is thus

2) Network Details

The image module consists of 8 convolutional and 2 fully connected layers. The convolution kernel size is 5 in the first layer and 3 in the following layers. The first, third,and fifth convolutional layers have a stride of 2. The number of channels increases from 32 in the first convolutional layer to 256 in the
last. Fully-connected layers contain 512 units each.
We used ReLU nonlinearities after all hidden layers, performed batch normalization after convolutional layers, applied 50% dropout after fully-connected hidden
layers, and used 20% dropout after convolutional layers
All models were trained using the Adam solver [16] with minibatches of 120 samples and an initial learning rate of 0.0002

3) Training Data Distribution

When performing imitation learning, a key decision is how to collect the training data
The simplest solution is to collect trajectories from natural demonstrations of an
expert performing the task. This typically leads to unstable policies, since a model that is only trained on expert trajectories may not learn to recover from disturbance or drift
In DAgger [29], the expert remains in the loop during the training of the controller: the controller is iteratively tested and samples from the obtained trajectories are re-labeled by the expert
In this paper we adopt a three-camera setup inspired by Bojarski

4) Data Augmentation

We found data augmentation to be crucial for good gen-eralization
Transformations include change in contrast, brightness, and tone, as well as addition of Gaussian blur, Gaussian noise, salt-and-pepper noise, and region dropout

System Setup

We evaluated the presented approach in a simulated urban environment and on a
physical system – a 1/5 scale truck. In both cases, the observations
(images) are recorded by one central camera and two lateral cameras
rotated by 30 degrees with respect to the center.
The recorded control signal is two-dimensional: steering angle and
acceleration
During training data collection, when approaching an intersection the
driver uses buttons on a physical steering wheel (when driving in simulation) or on the remote control (when operating the physical truck) to indicate the command corresponding to the intended course of action. The driver indicates the
command when the intended action becomes clear, akin to turn indicators in cars or navigation instructions provided by mapping applications. This way we collect realistic data that reflects how a higher level planner or a human could direct the system

1) Simulated Environment

We use CARLA [10],. CARLA is an open-source simulator implemented using Unreal Engine 4.
In order to collect training data, a human driver is presented with a
first-person view of the environment (center camera) at a resolution of 800×600 pixels. The driver controls the simulated vehicle using a physical steering wheel and pedals, and provides command input using buttons on the steering wheel. The driver keeps the car at a speed below 60 km/h and strives to avoid collisions with cars and pedestrians, but ignores traffic lights and stop signs. We record images from the three simulated cameras, along with other measurements such as speed and the position of the car. The images are cropped to remove part of the sky. CARLA also provides extra
information such as distance travelled, collisions, and the occurrence of
infractions such as drift onto the opposite lane or the sidewalk. This
information is used in evaluating different controller

2) Physical System

The setup of the physical system is shown in Figure 6. We equipped an
off-the-shelf 1/5 scale truck (Traxxas Maxx) with an embedded computer (Nvidia TX2), three low-cost webcams, a flight controller (Holybro Pixhawk) running the APMRover firmware, and supporting electronics.

Experiments

1) Simulated Environment

The use of the CARLA simulator enables running the evaluation in an episodic
setup

Share on

Twitter Facebook Google+ LinkedIn

Mr Ko

Conditional Imitation Learning CARLA

Conditional Imitation Learning at CARLA

End-to-end Driving via Conditional Imitation Learning (Paper)

Abstract

Introduction

Conditional Imitation Learning

Methodology

System Setup

Experiments

Share on

You May Also Enjoy

Reinforcement Learning for Market

Elf three rings

Keras Image Generator

Nvidia Self Driving Car Model

Mr Ko

Conditional Imitation Learning at CARLA

End-to-end Driving via Conditional Imitation Learning (Paper)

Abstract

Introduction

Related work

Conditional Imitation Learning

Methodology

System Setup

Experiments

Share on

You May Also Enjoy

Reinforcement Learning for Market

Elf three rings

Keras Image Generator

Nvidia Self Driving Car Model