Policy Gradient Basic

1 minute read

Tensorflow gradient explaination

TF uses automatic differentiation and more specifically reverse-mode auto differentiation.

There are 3 popular methods to calculate the derivative:
- Numerical differentiation
  Numerical differentiation relies on the definition of the derivative: , where you put a very small h and evaluate function in two places. This is the most basic formula and on practice people use other formulas which give smaller estimation error. This way of calculating a derivative is suitable mostly if you do not know your function and can only sample it. Also it requires a lot of computation for a high-dim function.
- Symbolic differentiation
  Symbolic differentiation manipulates mathematical expressions. If you ever used matlab or mathematica, then you saw something like this

Here for every math expression they know the derivative and use various rules (product rule, chain rule) to calculate the resulting derivative. Then they simplify the end expression to obtain the resulting expression.

Automatic differentiation => tensorflow gradient optimizer
Automatic differentiation manipulates blocks of computer programs. A differentiator has the rules for taking the derivative of each element of a program (when you define any op in core TF, you need to register a gradient for this op). It also uses chain rule to break complex expressions into simpler ones. Here is a good example how it works in real TF programs with some explanation. tensorflow gradient explain

for example below tensorflow code shows us simple policy gradient implementation

Tensorflow just need J function and optimizer function executes the gradient like below

Neural Network

Neural network

The core of our new agent is a neural network that decides what to do in a given situation. There are two sets of outputs – the policy itself and the value function.

Neural network architecture

It is defined very easily with Keras:

l_input = Input( batch_shape=(None, NUM_STATE) )
l_dense = Dense(16, activation='relu')(l_input)
 
out_actions = Dense(NUM_ACTIONS, activation='softmax')(l_dense)
out_value   = Dense(1, activation='linear')(l_dense)
 
model = Model(inputs=[l_input], outputs=[out_actions, out_value])

The policy output goes through softmax activation to make it correct probability distribution. The ouput for value function is linear, as we need all values to be possible.

policy gradient basic