Conventions:
Note: This post is for comparing the differences and understanding the similarities of various model-free control algorithms in (deep) reinforcement learning (especially with function approximations). It is not intended to be a primer or a comprehensive refresher. Please refer to Sutton & Barto, 2017 for completeness.
equivalently,
\[q_{t,\mathbf w_t}^{(n)}= \sum_{i=1}^n \gamma^{i-1} * r_{t+i} + \gamma^{i} * \hat Q(s_{t+n},a_t, \mathbf w_t)\]At time step $t+1$, the error at timestep $t$ is used to adjust the estimate of $Q(s_t,a_t,\mathbf{w_t})$ through the following weight (of the Q-value function approximator) update:
\[\Delta \mathbf{w_{t+1}}= \alpha * [ r_{t+1} + \gamma * \hat Q(s_{t+1}, a_t , \mathbf{w_t}) - \hat Q(s_t,a_t,\mathbf{w_t})] \nabla_w \hat Q(s_t,a_t,\mathbf{w_t})\]This is similar to the $TD(0)$ update step for state-value function.
At time step $t+1$, the error at timestep $t$ is used to adjust the estimate of $Q(s_t,a_t,w_t)$ through the following weight (of the Q-value function approximator) update:
\(\Delta \mathbf{w_{t+1}}= \alpha * [ r_{t+1} + \gamma * \max\limits_{a\in \mathcal A} \hat Q(s_{t+1}, a , \mathbf{w_t}) - \hat Q(s_t,a_t,\mathbf{w_t})] \nabla_w \hat Q(s_t,a_t,\mathbf{w_t})\) This is similar to the $TD(0)$/ SARSA update step except for the $\max_{a\in\mathcal A}$ on the $\hat Q$ function (which means that the $\hat Q$ value associated with the action that yields the maximum value in state $s_{t+1}$ is used or equivalently, $\hat Q(s_{t+1},argmax_{a’}(\hat Q(s_{t+1},a’)),w)$ ).