November 13, 2025
Misraj AI
AI
After we learn all those algorithms during this course, in this article I am not going to introduce a new algorithm, but I am going to combine some of the previous algorithms to improve the accuracy, ...
After we learn all those algorithms during this course, in this article I am not going to introduce a new algorithm, but I am going to combine some of the previous algorithms to improve the accuracy, just like when we combined the Prioritized Experience buffer with Double DQN.
This article has an implementation and explanation to “[1] Matteo Hessel, etc 2017. Rainbow: Combining Improvements in Deep Reinforcement Learning.”
I am not going to explain too many details about the algorithms that have been used in that paper because I wrote an article about each one.
Double Q-learning
Conventional Q-learning is affected by an overestimation bias, due to the maximization step in Equation 1, and this can harm learning. Double Q-learning (van Hasselt 2010), addresses this overestimation by decoupling, in the maximization performed for the bootstrap target, the selection of the action from its evaluation. It is possible to effectively combine this with DQN (van Hasselt,Guez, and Silver 2016), using the loss
𝑌ᴰᵒᵘᵇˡᵉ=𝑅ₜ₊₁+𝛾𝑄(𝑠ₙₑₓₜ,𝑎𝑟𝑔𝑚𝑎𝑥ₐ𝑄(𝑠ₙₑₓₜ,𝑎,𝜃⁺),𝜃⁻)
DQN samples uniformly from the replay buffer. Ideally, we want to sample more frequently those transitions from which there is much to learn. As a
proxy for learning potential, prioritized experience replay
(Schaul et al. 2015)
The dueling network is a neural network architecture designed for value based RL. It features two streams of computation, the value and advantage
streams, sharing a convolutional encoder, and merged by a
special aggregator (Wang et al. 2016).
Q-learning accumulates a single reward and then uses the greedy action at the next step to bootstrap. Alternatively, forward-view multi-step targets can be used (Sutton 1988). We define the truncated n-step return
from a given state Sₜ as.
R⁽ⁿ⁾ₜ = ∑ₖ⁽ⁿ⁻¹⁾ 𝛾⁽ᵏ⁾ Rₜ₊ₖ₊₁
A multi-step variant of DQN is then defined by minimizing the alternative loss,
R⁽ⁿ⁾ₜ + 𝛾⁽ⁿ⁾ max ₐ Q(Sₜ₊ₙ,a,𝜃⁻)- Q(Sₜ,𝜃⁺)
We can learn to approximate the distribution of returns instead of the expected return. Bellemare, Dabney, and Munos (2017) proposed to model
such distributions with probability masses placed on a discrete support.
The limitations of exploring using greedy policies are clear in games such as Montezuma’s Revenge, where many actions must be executed to collect the first reward. Noisy Nets (Fortunato et al. 2017) propose a noisy linear layer that combines a deterministic and noisy stream.
I think this would be enough for more information please return to the previous articles.
you can get the full code for Github.
I hope you find this article useful and I am very sorry If there is any error or Misspelled in the article, Please do not hesitate to contact me, or drop comment to correct things.
Khalil Hennara
Ai Engineer at MISRAJ
Contact us to discover how Mesraj's technologies can transform the way your organization works.
Start your journey to smarter solutions