November 12, 2025
Misraj AI
AI
In this article, we will implement Double DQN with proportional prioritization from Schaul, etc 2016, as it outperforms the DDQN with a uniform probability of transition in the replay buffer.
In this article, we will implement Double DQN with proportional prioritization from Schaul, etc 2016, as it outperforms the DDQN with a uniform probability of transition in the replay buffer.
The Experience replay as we have shown in the previous lecture, if we use a replay buffer that breaks the temporal correlations by mixing more and less recent experience for the updates, the rare experience will be used for more than just a single update. That was explained in the Deep Q-Network (DQN) algorithm (Mnih et al., 2013; 2015), which stabilized the training of a value function, represented by a deep neural network, by using experience replay
More specifically, experiences are considered important if they are likely to lead to fast learning progress, While this measure is not directly accessible, a reasonable proxy is the magnitude of a transition’s TD error δ, which indicates how surprising or unexpected the transition is: specifically, how far the value is from its next-step bootstrap estimate.
his idea it’s almost the same as boosting where we create a new classifier for the examples we get them wrong and train on those misses predict examples, but this may lead to over-fit because we train on some examples more than other.
As boosting this will happen for greedy TD-error First, to avoid expensive sweeps over the entire replay memory, the TD errors are only updated for the transitions that are replayed. One disadvantage, the transitions that have a low TD error on the first visit may not be replayed for a long time. Finally, greedy prioritization focuses on a small subset of the experience: errors shrink slowly, especially when using function approximation, meaning that the initially high error transitions get replayed frequently. This lack of diversity makes the system prone to over-fitting.
To overcome these issues, we introduce a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling. We ensure that the probability of being sampled is monotonic in a transition’s priority while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, we define the probability of sampling transition i as
𝑃(𝑖)=(Pᵢ)ᵃ / ∑ₖ (Pₖ)ᵃ
where Pᵢ>0 is the priority of transition i. The exponent α determines how much prioritization is used, with α=0 corresponding to the uniform case. so α become another hyperparameter we need to tune but for the Atari game as the Author of the paper find α=0.6
is the best for this problem but for other problems we should tune its value for the best approximation.
𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑡𝑁𝑜𝑡𝑒:
Pᵢ>0 is the TD-error which is from the Definition [𝑅ⱼ+𝛾ⱼ𝑄(𝑠ⱼ₊₁,𝑎𝑟𝑔𝑚𝑎𝑥ₐ𝑄(𝑠ⱼ₊₁,𝑎,𝜃⁺),𝜃⁻)−𝑄(𝑠ⱼ,𝑎ⱼ,𝜃⁺)] but we will use the error from the Neural Network, which is the squared of the TD-error because the target of our model is [𝑅ⱼ+𝛾ⱼ𝑄(𝑠ⱼ₊₁,𝑎𝑟𝑔𝑚𝑎𝑥ₐ𝑄(𝑠ⱼ₊₁,𝑎,𝜃⁺),𝜃⁻),)] and that is the same as before without the last term which is the output of the model, we add small positive constant 𝜖
that prevents the edge-case of transitions not being revisited once their error is zero.
Now after we prioritized the transition, the transition with hight priority will be sampled more often until the error of this transition go down and that may laid to over-fit.We can correct this bias by using importance-sampling (IS) weights:
𝑤ᵢ=(𝑁∗𝑃(𝑖))−𝛽
where 𝑁 is the Number of samples in the memory and 𝑃(𝑖) is the probability of transition 𝑖.
𝛽 is a hyperparameter that controls how much we want to compensate for the importance sampling bias (0 means not at all, while 1 means entirely). In the paper, the authors used β=0.4 at the beginning of training and linearly increased it to 𝛽=1 by the end of training. Again, the optimal value will depend on the task After we highlight the essential point, let’s start the implementation.
The sum-tree data structure used here is very similar in spirit to the array representation of a binary heap. However, instead of the usual heap property, the value of a parent node is the sum of its children. Leaf nodes store the transition priorities and the internal nodes are intermediate sums, with the parent node containing the sum over all priorities, 𝑝𝑡𝑜𝑡𝑎𝑙.
This provides an efficient way of calculating the cumulative sum of priorities, allowing O(log N ) updates and sampling. To sample a mini-batch of size k, the range [0,𝑝𝑡𝑜𝑡𝑎𝑙] is divided equally into k ranges. Next, a value is uniformly sampled from each range. Finally, the transitions that correspond to each of these sampled values are retrieved from the tree besides this structure, we will add another property rolling data structure, when the data is full it’s rolling to the first element and overwrite it. a good article explain the Sum Tree : Introduction to Sum Tree
This class is implemented to Sum Tree data structure this data structure is a complete binary tree and is the same as Priority Queue where the root node contains the sum of all leaf nodes in the Tree using this data structure reduces the update time to 𝑂(𝑙𝑜𝑔(𝑛)), and finds the sum to 𝑂(1). The root node has 𝑖𝑛𝑑𝑒𝑥=1 and keeps the 0 index unused and that for make the left-children is 2 x parent_index for simplicity.
we implement this data structure for use in Proportional Prioritization in the Prioritized Experience Replay from the DeepMind
Attributes:
method:
Now we will implement the Memory class that depends on the Sum-Tree data structure which is the replay buffer. the replay buffer now is Sum-Tree with some other functions the most important one is sample experiences function in this function, we sample the data from the memory as we explained above and also calculate the sample weights as in Equation 𝑤𝑖=(𝑁∗𝑃(𝑖))−𝛽
This class is used to create our replay buffer to use in the Prioritized Experience Replay
Attributes:
We can train any kind of Agent we had created before, like DQN, DDQN, Dueling-DQN, SARSA, etc.. as the author of the paper said, using the Prioritized Replay Buffer outperforms all previous algorithms if it’s using uniform sampling. We will change the Double_DQN class that we had created in the previous tutorial.
Algorithm:

#----> the memory where we save the observations for training we now use our memory.
self.replay_buffer=Memory(replay_size)
Here we replace the dueque by Memory object.
def _sample_experiences(self):
#----> instead of using random choice to select transition we use the prioritized memory object to sample data.
batch,weights,indices=replay_buffer.sample_experiences(self.batch_size)
# we combine the experiance togather where the buffer have tuple like this (state,action,reward,done,next_state)
states,actions,rewords,dones,next_states=[np.array([experiance[field_index] for experiance in batch]) for field_index in range(5)]
#----> we return the weights and inices becouse we need them for training step.
return states,actions,rewords,dones,next_states,weights,indices
Sample experiences function have a little change from the previous one. We use the Memory object sample experience function which returns not just the transition but the weights and the indices of each training example we used for one batch training. The indices are required to update the priority after making one training iteration.
Contact us to discover how Mesraj's technologies can transform the way your organization works.
Start your journey to smarter solutions
method:
def _training_step(self):
#----> get the batch from the replay_buffer (our memory) with weights and indices.
states,actions,rewards,dones,next_states,weights,indices=self._sample_experiences()
# |
# | the same as before
# |
#start the gradient recording to compute the gradient for our model.
with tf.GradientTape() as tape:
# |
# | the same as before.
# |
#-----> computer the loss function with the target we compute erlaier, these losses will be the new Priority .
losses=self.loss_function(target_Q_values,Q_values)
#-----> computer the wighted loss then reduce to the mean so we can compute the gradeint
loss = tf.reduce_mean(losses * weights)
# |
# | the same as before
#----> onther change where we update the priority of the transition we use in the training step.
self.replay_buffer.update(indices,losses)There is an Extension to this algorithm as the author of the paper suggests, the new priority 𝑝ᵢ=𝑙𝑜𝑠𝑠ᵢ can be loss*weights so 𝑝ᵢ=𝑙𝑜𝑠𝑠ᵢ∗𝑤𝑒𝑖𝑔ℎ𝑡𝑠ᵢ
then using the stochastic probability. it’s so simple to modify the code for the extension, but I will keep it as the pseudocode in the paper.
you can use the full code from Github.
I hope you find this article useful and I am very sorry If there is any error or Misspelled in the article, Please do not hesitate to contact me, or drop comment to correct things.
Khalil Hennara
Ai Engineer at MISRAJ