Distributional Perspective on
Reinforcement Learning (C51)

December 8, 2025

Misraj AI

Distributional Perspective on Reinforcement Learning (C51)

The core idea of Distributional Bellman is to ask the following questions. If we can model the Distribution of the total future rewards, why restrict ourselves to the expected value (i.e. 𝑄(𝑠,𝑎))? There are several benefits to learning an approximate distribution rather than its approximate expectation.

Consider a commuter who drives to work every morning. We want to model the total commute time of the trip. Under normal circumstances, the traffic is cleared and the trip would take around 30 minutes. However, traffic accidents do occur once in a while (e.g. car crashes, break down in the middle of the highway, etc), and if that happens, it will usually cause the traffic to be at a standstill and add an hour to the trip (i.e. 90 minutes). Let’s say traffic accident like that happens once every 5 days on average. If we use the expected value to model, the commute time of the trip, the expected commute time will be 30 + 60 / 5 = 42 minutes. However, we know such an expected figure is not so meaningful, since it vastly overestimates the commute time most of the time, and vastly underestimates the commute time when traffic accidents do occur. If instead we treat the total commute time as a random variable and model its distribution, it should look like this:

Notice that the distribution of commute time is bimodal. Most of the time the trip would take 30 minutes on average, however, if a traffic accident occurs, the commute time would take 90 minutes on average. Given the full picture of the distribution, next time when we head out to work, we’d better look up the traffic situation on the highway. If a traffic accident is reported, we can choose to bike to work which would take around 60 minutes, which could save us 30 minutes!

Choosing Action based on Distribution instead of Expected Value

In reinforcement learning, we use the Bellman equation to approximate the expected value of future rewards. As illustrated in the commute time example above, if the environment is stochastic in nature (occurrence of traffic accidents) and the future rewards follow multimodal distribution (bimodally distributed commute time), choosing actions based on expected value may lead to a suboptimal outcome. In the example above, if we realize there is a traffic accident and it will likely take 90 minutes to get to the office, the optimal action would be to bike even though the expected commute time of biking is 60 minutes which is larger than the expected commute time of driving (42 minutes).

Another obvious benefit of modeling distribution instead of expected value is that sometimes even though the expected future rewards of 2 actions are identical, their variances might be very different. If we are risk averse, it would be preferable to choose the action with a smaller variance. Using the commute time example again, if we have 2 actions to choose from driving or taking the train, both actions have the same expected commute time (i.e. 42 minutes), but taking the train has a smaller variance since it does not get affected by unexpected traffic conditions. Most people would prefer taking the train over driving.

Categorical DQN

Bellemare et al. (2017) proposed to learn the value distribution (the probability distribution of the returns) through a modification of the Bellman equation. They show that learning the complete distribution of rewards instead of their mean leads to performance improvements on Atari games over modern variants of DQN.

Their proposed categorical DQN (also called C51) has an architecture based on DQN, but the output layer predicts the distribution of the returns for each action 𝑎 in state 𝑠 instead of its mean 𝑄𝜋(𝑠,𝑎). In practice, each action a is represented by 𝑁 output neurons, who encode the support of the distribution of returns. If the returns take values between 𝑉𝑚𝑖𝑛 and 𝑉𝑚𝑎𝑥, one can represent their distribution 𝑍 by taking 𝑁 discrete “bins” (called atoms in the paper) in that range. Fig.2 shows how the distribution of returns between -10 and 10 can be represented using 21 atoms.

Of course, the main problem is to know in advance the range of returns [𝑉𝑚𝑖𝑛,𝑉𝑚𝑎𝑥] (it depends largely on the choice of the discount rate 𝛾

you can infer it from training another algorithm such as DQN beforehand. Dabney et al. (2017) got rid of this problem with quantile regression. In the paper, the authors found out experimentally that 51 is the most efficient number of atoms (hence the name C51).

Let’s note 𝑧ᵢ these atoms with 1≤<. The atom probability that the return associated to a state-action pair (,) lies within the bin associated to the atom is noted (,). These probabilities can be predicted by a neural network, typically by using a softmax function over outputs (,)

Implementation

#The size of trainig example 
BATCH_SIZE=32# size of frame the input tensor to the model.
WIDTH=84
HIGHT=84
CHANNELs=4# the Number of support in discrete distribution parameter 
ATOMS=51
# the range that the support value derived from.
VMIN=-10
VMAX=10#numbe of action
ACTIONS=4 #env.action_space.n# the step size between supports 
DELTA_Z=(VMAX-VMIN)/(ATOMS-1)# the distribuation parameter.
Z=np.linspace(VMIN,VMAX,ATOMS,dtype=np.float32)

The model have the same structure of Mnih etc 2015 but we apply the softmax function on the output layer to compute the probability mas function of the distribution.

the output has shape (BatchSize,Actions*Atoms) so we are trying to find the distribution over each action.

we reshape the output to become (BatchSize,Actions,Atoms) so we can apply the softmax over the last dimension. we create custom model in tensroflow the code is too simple as we can see below.

we will create a toy batch for testing. it’s a good practice to do while you develop a complicated model.

Summary:

I hope you find this article useful and I am very sorry If there is any error or Misspelled in the article, Please do not hesitate to contact me, or drop comment to correct things.

Khalil Hennara
Ai Engineer at MISRAJ

Built on Trust. Measured by Impact.