December 8, 2025
Misraj AI
AI
The core idea of Distributional Bellman is to ask the following questions. If we can model the Distribution of the total future rewards, why restrict ourselves to the expected value (i.e. 𝑄(𝑠,𝑎))? ...
The core idea of Distributional Bellman is to ask the following questions. If we can model the Distribution of the total future rewards, why restrict ourselves to the expected value (i.e. 𝑄(𝑠,𝑎))? There are several benefits to learning an approximate distribution rather than its approximate expectation.
Consider a commuter who drives to work every morning. We want to model the total commute time of the trip. Under normal circumstances, the traffic is cleared and the trip would take around 30 minutes. However, traffic accidents do occur once in a while (e.g. car crashes, break down in the middle of the highway, etc), and if that happens, it will usually cause the traffic to be at a standstill and add an hour to the trip (i.e. 90 minutes). Let’s say traffic accident like that happens once every 5 days on average. If we use the expected value to model, the commute time of the trip, the expected commute time will be 30 + 60 / 5 = 42 minutes. However, we know such an expected figure is not so meaningful, since it vastly overestimates the commute time most of the time, and vastly underestimates the commute time when traffic accidents do occur. If instead we treat the total commute time as a random variable and model its distribution, it should look like this:
Notice that the distribution of commute time is bimodal. Most of the time the trip would take 30 minutes on average, however, if a traffic accident occurs, the commute time would take 90 minutes on average. Given the full picture of the distribution, next time when we head out to work, we’d better look up the traffic situation on the highway. If a traffic accident is reported, we can choose to bike to work which would take around 60 minutes, which could save us 30 minutes!
In reinforcement learning, we use the Bellman equation to approximate the expected value of future rewards. As illustrated in the commute time example above, if the environment is stochastic in nature (occurrence of traffic accidents) and the future rewards follow multimodal distribution (bimodally distributed commute time), choosing actions based on expected value may lead to a suboptimal outcome. In the example above, if we realize there is a traffic accident and it will likely take 90 minutes to get to the office, the optimal action would be to bike even though the expected commute time of biking is 60 minutes which is larger than the expected commute time of driving (42 minutes).

Another obvious benefit of modeling distribution instead of expected value is that sometimes even though the expected future rewards of 2 actions are identical, their variances might be very different. If we are risk averse, it would be preferable to choose the action with a smaller variance. Using the commute time example again, if we have 2 actions to choose from driving or taking the train, both actions have the same expected commute time (i.e. 42 minutes), but taking the train has a smaller variance since it does not get affected by unexpected traffic conditions. Most people would prefer taking the train over driving.
Bellemare et al. (2017) proposed to learn the value distribution (the probability distribution of the returns) through a modification of the Bellman equation. They show that learning the complete distribution of rewards instead of their mean leads to performance improvements on Atari games over modern variants of DQN.
Their proposed categorical DQN (also called C51) has an architecture based on DQN, but the output layer predicts the distribution of the returns for each action 𝑎 in state 𝑠 instead of its mean 𝑄𝜋(𝑠,𝑎). In practice, each action a is represented by 𝑁 output neurons, who encode the support of the distribution of returns. If the returns take values between 𝑉𝑚𝑖𝑛 and 𝑉𝑚𝑎𝑥, one can represent their distribution 𝑍 by taking 𝑁 discrete “bins” (called atoms in the paper) in that range. Fig.2 shows how the distribution of returns between -10 and 10 can be represented using 21 atoms.

Of course, the main problem is to know in advance the range of returns [𝑉𝑚𝑖𝑛,𝑉𝑚𝑎𝑥] (it depends largely on the choice of the discount rate 𝛾
you can infer it from training another algorithm such as DQN beforehand. Dabney et al. (2017) got rid of this problem with quantile regression. In the paper, the authors found out experimentally that 51 is the most efficient number of atoms (hence the name C51).
Let’s note 𝑧ᵢ these atoms with 1≤<. The atom probability that the return associated to a state-action pair (,) lies within the bin associated to the atom is noted (,). These probabilities can be predicted by a neural network, typically by using a softmax function over outputs (,)
#The size of trainig example
BATCH_SIZE=32# size of frame the input tensor to the model.
WIDTH=84
HIGHT=84
CHANNELs=4# the Number of support in discrete distribution parameter
ATOMS=51
# the range that the support value derived from.
VMIN=-10
VMAX=10#numbe of action
ACTIONS=4 #env.action_space.n# the step size between supports
DELTA_Z=(VMAX-VMIN)/(ATOMS-1)# the distribuation parameter.
Z=np.linspace(VMIN,VMAX,ATOMS,dtype=np.float32)
The model have the same structure of Mnih etc 2015 but we apply the softmax function on the output layer to compute the probability mas function of the distribution.
the output has shape (BatchSize,Actions*Atoms) so we are trying to find the distribution over each action.
we reshape the output to become (BatchSize,Actions,Atoms) so we can apply the softmax over the last dimension. we create custom model in tensroflow the code is too simple as we can see below.
we will create a toy batch for testing. it’s a good practice to do while you develop a complicated model.
I hope you find this article useful and I am very sorry If there is any error or Misspelled in the article, Please do not hesitate to contact me, or drop comment to correct things.
Khalil Hennara
Ai Engineer at MISRAJ
Contact us to discover how Mesraj's technologies can transform the way your organization works.
Start your journey to smarter solutions
𝑝ᵢ(𝑠,𝑎)=𝑒xp(θᵢ(𝑠,𝑎))/∑ⱼ=₁ exp(𝜃ⱼ(𝑠,𝑎))
The distribution of the returns 𝑍 is simply a sum over the atoms (represented by the Dirac distribution 𝛿𝑧ᵢ)
𝑍θ(𝑠,𝑎)=∑ᵢ=₁ Pᵢ(𝑠,𝑎)δ𝑧ᵢ
If these probabilities are correctly estimated, the 𝑄−𝑣𝑎𝑙𝑢𝑒 is easy to compute as the mean of the distribution:
𝑄θ(𝑠,𝑎)=𝐸[𝑍θ(𝑠,𝑎)]=∑ᵢ=₁ Pᵢ(𝑠,𝑎)𝑧ᵢ
These 𝑄−𝑣𝑎𝑙𝑢𝑒𝑠 can then be used for action selection as in the regular DQN. The problem is now to learn the value distribution 𝑍θ to find a learning rule / loss function for the 𝑝ᵢ(𝑠,𝑎;θ). Let’s consider a single transition (s,a,r,s′) and select the greedy action a′ in s′ using the current policy 𝜋θ. The value distribution 𝑍θ can be evaluated by applying recursively the Bellman operator 𝑇
𝑇𝑍θ(𝑠,𝑎)=𝑅(𝑠,𝑎)+𝛾∗𝑍θ(𝑠′,𝑎′)
where 𝑅(𝑠,𝑎) is the distribution of immediate rewards after taking (s,a)
This use of the Bellman operator is the same as in 𝑄-learning:
𝑇𝑄θ(𝑠,𝑎)=𝐸[𝑟(𝑠,𝑎)]+γ∗𝑄θ(𝑠′,𝑎′)
In 𝑄-learning, one minimizes the difference (mse) between 𝑇𝑄θ(𝑠,𝑎) and 𝑄θ(𝑠,𝑎), which are expectations (so we only manipulate scalars). Here, we will minimize the statistical distance between the distributions 𝑇𝑍θ(𝑠,𝑎),𝑍θ(𝑠,𝑎) themselves, using for example the KL divergence, Wasserstein metric, total variation.

The problem is mostly that, the distributions 𝑇𝑍θ(𝑠,𝑎) and 𝑍θ(𝑠,𝑎) do not have the same support: for a particular atom 𝑧ᵢ, 𝑇𝑍θ(𝑠,𝑎) can have a non-zero probability 𝑝ᵢ(𝑠,𝑎), while 𝑍θ(𝑠,𝑎) has a zero probability. Besides, the probabilities must sum to 1, so one cannot update the 𝑧ᵢ independently from one another.
There is a litle change in the epsilon greedy function because the model no longer outputs the action-value function the output now is the distribution over the actions we have, so to select the greedy action we will compute the mean overall distribution then chose the best action according to the mean.
the output for one example is (ACTIONS,ATOMS) so we take the sum using tf.reduce_sum function on the last axis. then take the argmax over the result.
The Categorical algorithm as in paper Bellemare etc 2017 this function find the action-value distribution as we describe above please read the comment for better understanding.
Now we will use our training step function on the toy dataset to make sure there is no bugs in our implementation.
target,pred=None,None
target,pred=training_step(0.9)
The figure above is the distribution of move left of a one example. if we train the model on the same batch for enough time, the two distribution should be identical if everything go right. after we training for 200 epochs.

you can’t distinguished between the two distribution and that looks good isn’t it ?.