DoubleDQN with
Proportional Prioritization

November 12, 2025

Misraj AI

In this article, we will implement Double DQN with proportional prioritization from Schaul, etc 2016, as it outperforms the DDQN with a uniform probability of transition in the replay buffer.

In this article, we will implement Double DQN with proportional prioritization from Schaul, etc 2016, as it outperforms the DDQN with a uniform probability of transition in the replay buffer.

The Experience replay as we have shown in the previous lecture, if we use a replay buffer that breaks the temporal correlations by mixing more and less recent experience for the updates, the rare experience will be used for more than just a single update. That was explained in the Deep Q-Network (DQN) algorithm (Mnih et al., 2013; 2015), which stabilized the training of a value function, represented by a deep neural network, by using experience replay

More specifically, experiences are considered important if they are likely to lead to fast learning progress, While this measure is not directly accessible, a reasonable proxy is the magnitude of a transition’s TD error δ, which indicates how surprising or unexpected the transition is: specifically, how far the value is from its next-step bootstrap estimate.

his idea it’s almost the same as boosting where we create a new classifier for the examples we get them wrong and train on those misses predict examples, but this may lead to over-fit because we train on some examples more than other.

As boosting this will happen for greedy TD-error First, to avoid expensive sweeps over the entire replay memory, the TD errors are only updated for the transitions that are replayed. One disadvantage, the transitions that have a low TD error on the first visit may not be replayed for a long time. Finally, greedy prioritization focuses on a small subset of the experience: errors shrink slowly, especially when using function approximation, meaning that the initially high error transitions get replayed frequently. This lack of diversity makes the system prone to over-fitting.

To overcome these issues, we introduce a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling. We ensure that the probability of being sampled is monotonic in a transition’s priority while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, we define the probability of sampling transition i as

𝑃(𝑖)=(Pᵢ)ᵃ / ∑ₖ (Pₖ)ᵃ

where Pᵢ>0 is the priority of transition i. The exponent α determines how much prioritization is used, with α=0 corresponding to the uniform case. so α become another hyperparameter we need to tune but for the Atari game as the Author of the paper find α=0.6

is the best for this problem but for other problems we should tune its value for the best approximation.

𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑡𝑁𝑜𝑡𝑒:

Pᵢ>0 is the TD-error which is from the Definition [𝑅ⱼ+𝛾ⱼ𝑄(𝑠ⱼ₊₁,𝑎𝑟𝑔𝑚𝑎𝑥ₐ𝑄(𝑠ⱼ₊₁,𝑎,𝜃⁺),𝜃⁻)−𝑄(𝑠ⱼ,𝑎ⱼ,𝜃⁺)] but we will use the error from the Neural Network, which is the squared of the TD-error because the target of our model is [𝑅ⱼ+𝛾ⱼ𝑄(𝑠ⱼ₊₁,𝑎𝑟𝑔𝑚𝑎𝑥ₐ𝑄(𝑠ⱼ₊₁,𝑎,𝜃⁺),𝜃⁻),)] and that is the same as before without the last term which is the output of the model, we add small positive constant 𝜖

that prevents the edge-case of transitions not being revisited once their error is zero.

Now after we prioritized the transition, the transition with hight priority will be sampled more often until the error of this transition go down and that may laid to over-fit.We can correct this bias by using importance-sampling (IS) weights:

𝑤ᵢ=(𝑁∗𝑃(𝑖))−𝛽

where 𝑁 is the Number of samples in the memory and 𝑃(𝑖) is the probability of transition 𝑖.

𝛽 is a hyperparameter that controls how much we want to compensate for the importance sampling bias (0 means not at all, while 1 means entirely). In the paper, the authors used β=0.4 at the beginning of training and linearly increased it to 𝛽=1 by the end of training. Again, the optimal value will depend on the task After we highlight the essential point, let’s start the implementation.

First: The Sum-Tree:

The sum-tree data structure used here is very similar in spirit to the array representation of a binary heap. However, instead of the usual heap property, the value of a parent node is the sum of its children. Leaf nodes store the transition priorities and the internal nodes are intermediate sums, with the parent node containing the sum over all priorities, 𝑝𝑡𝑜𝑡𝑎𝑙.

This provides an efficient way of calculating the cumulative sum of priorities, allowing O(log N ) updates and sampling. To sample a mini-batch of size k, the range [0,𝑝𝑡𝑜𝑡𝑎𝑙] is divided equally into k ranges. Next, a value is uniformly sampled from each range. Finally, the transitions that correspond to each of these sampled values are retrieved from the tree besides this structure, we will add another property rolling data structure, when the data is full it’s rolling to the first element and overwrite it. a good article explain the Sum Tree : Introduction to Sum Tree

This class is implemented to Sum Tree data structure this data structure is a complete binary tree and is the same as Priority Queue where the root node contains the sum of all leaf nodes in the Tree using this data structure reduces the update time to 𝑂(𝑙𝑜𝑔(𝑛)), and finds the sum to 𝑂(1). The root node has 𝑖𝑛𝑑𝑒𝑥=1 and keeps the 0 index unused and that for make the left-children is 2 x parent_index for simplicity.

we implement this data structure for use in Proportional Prioritization in the Prioritized Experience Replay from the DeepMind

Attributes:

size: represent the size of the buffer
tree: it’s the binary Tree that contains the priority which needs to get the sum and update.
data: where we save the information
current_pos: is the pointer to the current position of the data
n_entries: number of information we have.
max_prio: this is not related to the Sum Tree but we need it to track the Max priority for use in our algorithm and by doing this we reduce the time to 𝑂(1)

method:

add: using this method to add a transition to data list and priority to the tree and update the tree by ‘Bubbel up’ as the Priority Queue
update: just like ‘Bubbel up’ where add the priority to an empty slot in the tree and then update its parent until we rich the root node.
total: return the root value which is the sum of all leaves in the tree
get_sample: travels in the tree to find the priority and the data linked to that priority.

Secand: Memory

Now we will implement the Memory class that depends on the Sum-Tree data structure which is the replay buffer. the replay buffer now is Sum-Tree with some other functions the most important one is sample experiences function in this function, we sample the data from the memory as we explained above and also calculate the sample weights as in Equation 𝑤𝑖=(𝑁∗𝑃(𝑖))−𝛽

This class is used to create our replay buffer to use in the Prioritized Experience Replay

Attributes:

size: the replay buffer size.
tree: is the Sum Tree data structure.
alpha: this parameter using in the stochastic probability 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦=(𝑝𝑖+𝑒𝑝𝑠𝑖𝑙𝑜𝑛)
epsilon: a small positive value added to the error
beta: is the parameter used to calculate the weights

Finally : The Agent

We can train any kind of Agent we had created before, like DQN, DDQN, Dueling-DQN, SARSA, etc.. as the author of the paper said, using the Prioritized Replay Buffer outperforms all previous algorithms if it’s using uniform sampling. We will change the Double_DQN class that we had created in the previous tutorial.

Algorithm:

DDQN with Prioritized Experience

#---->  the memory where we save the observations for training we now use our memory.
self.replay_buffer=Memory(replay_size)

Here we replace the dueque by Memory object.

def _sample_experiences(self):
       #---->  instead of using random choice to select transition we use the prioritized memory object to sample data.
        batch,weights,indices=replay_buffer.sample_experiences(self.batch_size)
        # we combine the experiance togather where the buffer have tuple like this (state,action,reward,done,next_state)
        states,actions,rewords,dones,next_states=[np.array([experiance[field_index] for experiance in batch]) for field_index in range(5)]
#---->  we return the weights and inices becouse we need them for training step. 
        return states,actions,rewords,dones,next_states,weights,indices

Sample experiences function have a little change from the previous one. We use the Memory object sample experience function which returns not just the transition but the weights and the indices of each training example we used for one batch training. The indices are required to update the priority after making one training iteration.

Built on Trust. Measured by Impact.

Start your journey to smarter solutions

DoubleDQN with
Proportional Prioritization

First: The Sum-Tree:

Secand: Memory

Finally : The Agent

Built on Trust. Measured by Impact.

Rinforcment_learning_course

Summary:

DoubleDQN withProportional Prioritization

First: The Sum-Tree:

Secand: Memory

Finally : The Agent

Built on Trust. Measured by Impact.

Rinforcment_learning_course

Summary:

DoubleDQN with
Proportional Prioritization