Deep Deterministic Policy Gradient
- After Deep Q-Network became a hit, people realized that deep learning could be used to solve high-dimensional problems
\begin{equation} y_i = r_i + \gamma Q’(s_{i + 1}, \mu’(s_{i+1}|\theta^{\mu’})|\theta^{Q’}) \end{equation}
\begin{equation} L = \frac{1}{N} \sum_{i} (y_i - Q(s_i, a_i | \theta^{Q})^2) \end{equation}
\begin{equation} = \mathbb{E}{\mu’} \big[ \nabla{a} Q(s, a|\theta^{Q})|{s=s_t, a=\mu(s_t)} \nabla{\theta^{\mu}} \mu(s|\theta^{\mu})|_{s=s_t} \big] \end{equation}
\begin{equation} \theta^{\mu}{k + 1} = \theta^{\mu}_k + \alpha \mathbb{E}{\mu’^{k}} \big [ \nabla_{\theta} Q(s, \mu (s|\theta^{\mu}_k)|\theta^{Q}_k) \big ] \end{equation}
\begin{equation} \theta^{\mu}{k + 1} = \theta^{\mu}_k + \alpha \mathbb{E}{\mu’^{k}} \big [ \nabla_{a} Q(s, a|\theta^{Q}k)|{a=\mu(s|\theta^{\mu}k)} \nabla{\theta} \mu(s|\theta^{\mu}_k) \big ] \end{equation}