16 Mar 2017 # Model-free control algorithms for deep reinforcement learning --Similarities and differences (WIP)

Conventions:

- Sets are represented by $\mathcal{A, S}$,… (caligraphy font)
- Vectors are represented by $\mathbf{w, \theta,…}$ (bold font)
- Functions are represented by $Q,\hat Q, V, \hat V,$… (capital letters)
- Random variables are represened by $s,a,r$… (lower case letters)

**Note**: This post is for comparing the differences and understanding the similarities of various model-free control algorithms in (deep) reinforcement learning (especially with function approximations). It is not intended to be a primer or a comprehensive refresher. Please refer to Sutton & Barto, 2017 for completeness.

equivalently,

\[q_{t,\mathbf w_t}^{(n)}= \sum_{i=1}^n \gamma^{i-1} * r_{t+i} + \gamma^{i} * \hat Q(s_{t+n},a_t, \mathbf w_t)\]At time step $t+1$, the error at timestep $t$ is used to adjust the estimate of $Q(s_t,a_t,\mathbf{w_t})$ through the following weight (of the Q-value function approximator) update:

\[\Delta \mathbf{w_{t+1}}= \alpha * [ r_{t+1} + \gamma * \hat Q(s_{t+1}, a_t , \mathbf{w_t}) - \hat Q(s_t,a_t,\mathbf{w_t})] \nabla_w \hat Q(s_t,a_t,\mathbf{w_t})\]This is similar to the $TD(0)$ update step for state-value function.

At time step $t+1$, the error at timestep $t$ is used to adjust the estimate of $Q(s_t,a_t,w_t)$ through the following weight (of the Q-value function approximator) update:

\(\Delta \mathbf{w_{t+1}}= \alpha * [ r_{t+1} + \gamma * \max\limits_{a\in \mathcal A} \hat Q(s_{t+1}, a , \mathbf{w_t}) - \hat Q(s_t,a_t,\mathbf{w_t})] \nabla_w \hat Q(s_t,a_t,\mathbf{w_t})\) This is similar to the $TD(0)$/ SARSA update step except for the $\max_{a\in\mathcal A}$ on the $\hat Q$ function (which means that the $\hat Q$ value associated with the action that yields the maximum value in state $s_{t+1}$ is used or equivalently, $\hat Q(s_{t+1},argmax_{a’}(\hat Q(s_{t+1},a’)),w)$ ).