Policy Gradients Method
& REINFORCE

November 12, 2025

Misraj AI

In previous articles, we learned a lot about the value-function-based method. We define the state-value function as the Expected value of return 𝐺ₜ, where the return is the discounted reward from tim...

In previous articles, we learned a lot about the value-function-based method. We define the state-value function as the Expected value of return 𝐺ₜ, where the return is the discounted reward from time step 𝑡 until the end of the episode if we concern with the episodic task.

Gₜ= Rₜ+ γRₜ₊₁ + γ²Rₜ₊₂ + · · · (1)

V = E[Gₜ|sₜ= s] = E[rₜ+ γrₜ₊₁+ γ²rₜ₊₂ + · · · |sₜ= s] … (2)

By definition 𝑉𝜋(𝑠) is the importance of state 𝑆 following policy 𝜋. But let’s get back to the first article where we say that our purpose in RL is teaching the agent the best action at a specific state 𝑆, So we learn the policy𝜋. What we see in the previous lecture, where we compute the 𝑄(𝑠,𝑎) function, which gives us a notation about the importance of each action in a current state 𝑆, and we take the best decision according to the 𝜖−𝑔𝑟𝑒𝑒𝑑𝑦 policy.

The most important question here is, Instead of learning the state-value function(action-state value functions are the same), we can compute the policy directly, which is by definition a function mapping from state 𝑠 to some action 𝑎.

“Why don’t we just learn the policy directly?”.

π(a|s) = P (aₜ= a|sₜ= s) …(3)

In equation (3), we define a stochastic policy that gives us the probability distribution over 𝐴 for the current state 𝑠. We explain the differences between stochastic and deterministic policy before you can read more about it in this article.

In the following article,

I explain what is VFA, and we learn a lot of algorithms using VFA like Q-learning,c51, Rainbow, etc…

In this article, we will learn about the policy gradient method and its advantages and disadvantages, so let’s get started.

Let’s start with a motivation example. Let’s imagine that we want to make an agent who could drive a car (auto-driven car) there are infinite actions to control the steering wheel (infinite-angle to use), so how can we use the Q-value function in continuous action space? It’s much better to take action from the observing state directly, but this lead to another problem about deterministic policy, we will talk about that later, but for the curious mind, I will give them a hint about the solution instead of learning the action for the current state we will compute the mean and the stander deviation of the Action distribution then we can sampling actions from that distribution.

backing to the topic, we will just replace the Q_value function with 𝜋

function, so if we use a NN we will just replace the last layer with a new one that gives me the probability of taking some action instead of the value of taking that action. It’s easy right, 𝑁𝑂𝑂𝑂 it’s NOT.

When we use the Q-value function we use the Bellman equation, which expresses the value on the current step via the values on the next step. In another world, using the Value function will tell us the benefit of a state in the environment, so if we know the value of the state we can take the action easily. when we compute the state-value function we use the Bellman equation to make an update to the parameters.

V(sₜ) ← V(sₜ) + α[Rₜ+γV(sₜ₊₁)− V(sₜ)]… (4)

Target = Rₜ+ γV(sₜ₊₁)

δₜ= Rₜ+ γV (sₜ₊₁) − V (sₜ)

In contrast, when we use policy we don’t know anything about how much is good the action we take, so we can’t update the parameters of our function to take a good decision in a specific state. so we will define a performance function that will provide us with a notation of how good is the current policy.

J(θ)=v(θ) … (5)

where v(θ) is the true value function for πθ the policy determined by θ. if we take the derivative of the score function.

∇θJ(θ)=∇θv(θ) …. (6)

policy gradient computing

Now we will do a lot of math to find the gradient of the policy π. We will consider the episodic setting but it’s the same for the continuous one, let’s define the trajectories τ as follow.

τ=(s₀,a₀,r₀,s₁,a₁,r₁,…,sₜ,aₜ,rₜ)

R(τ)=∑ₜ R(sₜ,aₜ)

V(θ)=E [∑ₜᵀ R(sₜ,aₜ)|π(θ)]

=∑τ P(τ,θ)R(τ)

where P(τ,θ) represent the probability over full trajectories when executing policy πθ so now our goal is.

argmax_θ V = argmax_θ ∑τ P(τ,θ)R(τ) ….(7)

we can notice that the parameter θ it only appears in the distribution of trajectories we may counter under policy π.

Let’s take the derivative w.r.t θ.

we don’t know P(τ) for all possible trajectories, to solve that we will run the policy m time and average the output and by empirical estimate, we get

∇θ V(θ)=(1/m) ∑ᵢ R(τⁱ) ∇θ logP(τⁱ,θ)….(8)

the above equation shows that if we move in the direction of ∇V(θ) we will push up the log prog in proportion to how good is it, and that return because we multiply by R(τ) which is the sum of reward in trajectories τ. Let’s try to decompose the last equation, we can write the probability of trajectories τ as follow.

∇θlogP(τⁱ,θ)= ∇θ log(μ(s₀) Π P(sⁱⱼ₊₁|sⱼⁱ,aⱼⁱ)πθ(aⱼⁱ|sⱼⁱ)) ….(9)

πθ(aⱼⁱ|sⱼⁱ) is the probability of taking action aj given the state sⱼ at the trajectories i.

P(sⁱⱼ₊₁|sⱼⁱ,aⱼⁱ) is the probability of observing state sⱼ₊₁ giver action aⱼ and state sⱼ at trajectories i.

as we can see the first two terms are constant w.r.t θ so the derivative is 0 and that leaves us with one term which is the derivative of our policy w.r.t θ now we can see that transformation to log make sense, quite nice, isn’t it?

∇θV(θ) = (1/m) ∑ᵢmR(τⁱ)∑ ₜᵀ⁻¹ ∇logπθ(aⁱₜ|sⁱₜ)….(10)

This estimator is unbiased for the policy gradient because we take the average over m trajectories but is very noisy (high variance) because R(τ) may be very different from one trajectory to another one.

Using the temporal structure of rewards for the policy gradient

we can write equation (10) as follow.

∇θ V(θ)=E τ∼πθ [R(τ)∑ₜᵀ⁻¹ ∇θlogπθ(aₜ|sₜ)] ….(11)

Notice that the rewards R(τⁱ) are treated as a single number which is a function of an entire trajectory τⁱ. We can break this down into the sum of all the rewards encountered in the trajectory, Using this knowledge, we can derive the gradient estimate for a single reward term rₜ′.

∇θ E[rₜ′]=Eπθ[rₜ′∑ₜᵗ′ ∇θ logπθ(aₜ|sₜ)] ….(12)

In equation (12), I take the sum until t′.

and this is going to be a slightly lower variance than before and this what we going to use to update the parameter in REINFORCE algorithm.

We use the reverse trick which we explained before and we use γ, but as we say before in episodic tasks we can always use γ = 1.

In this article, I introduce the policy gradient method with a lot of math equations that make the updated form make sense. and I also introduce the REINFORC algorithm which will be updated in the next article using the baseline technique.

I hope you find this article useful and I am very sorry If there is any error or Misspelling in the article, Please do not hesitate to contact me, or drop a comment to correct things.

Khalil Hennara
Ai Engineer at MISRAJ

Built on Trust. Measured by Impact.

Start your journey to smarter solutions

Policy Gradients Method
& REINFORCE

Introduction to Reinforcement Learning

Value Function Approximation & DQN.

policy gradient

policy gradient computing

Using the temporal structure of rewards for the policy gradient

REINFORCE algorithm

summary

Built on Trust. Measured by Impact.

Policy Gradients Method& REINFORCE

Introduction to Reinforcement Learning

Value Function Approximation & DQN.

policy gradient

policy gradient computing

Using the temporal structure of rewards for the policy gradient

REINFORCE algorithm

summary

Built on Trust. Measured by Impact.

Policy Gradients Method
& REINFORCE