November 12, 2025
Misraj AI
AI
within this article, I will explain the memory usage of LLM during training operations. Large Language Models (LLMs), for instance, require substantial computational resources, especially regarding me...
In the last two years, we have seen a huge development in the field of AI, especially LLM. LLM is a Large Language Model trained on a huge amount of data. I will not provide any background on LLM and Transformers but I will jump directly to LLM memory diagnostic.
Not many engineers nor researchers in the field may be interested in how LLM memory consumption, especially within the training phase. When I started work on LLM it was frustrating to get the OOM (out of memory ) error whenever I tried to change some parameter with the training script. I’ve tried to find out why this happened to the Transformers structure, but I didn’t find any useful article about this subject.
We know that most of the state-of-the-art models in the field of NLP a Transformers decoder-only models such as GPT, Llama, Mistral, Phi, etc. So I decided to understand the memory usage of these kinds of models using manual computation. I’ve done a manual computation on how memory is consumed during the training phase, using the implementation of the Llama-2 model on the huggingeface.
I want to mention that this computation is applied to all Llama series, and Mistral series, and is close to other models as almost all models have the same structure, we will provide an estimation of MOE memory requirement using the implementation of the Mixtral model in another article.
The total memory requirement for any Deep Neural Network during Training time is as follows,
1- memory for model parameter
2- memory for gradient ‘the same as memory parameter’
3- memory for optimizer ‘twice memory parameter in case of Adam or the same as parameters in case of SGD’
4- memory for activation ‘The largest memory consumption part’
To make the computation easier we will compute these factors for each module ‘MLP and Attention’ on the decoder layer
The Llama modeling file on huggingface is modeling_llama. We will start with Llama computation and in a future article MOE computation.

The previous image is the decoder-only structure from the GPT paper, as we can see this is one layer of the model, and GPT has 12 layers as the figure shows.
So we can do the computation for one layer and then multiply the memory requirement for this layer by the number of layers.
We will separate the decoder layer into two main blocks The Attention layer and The MLP layer, Let's start with the MLP layer.
before going through the nasty job let's define the parameter we need to make all computation
Batch_size, seq_length, hidden_size, Layers, intermediate_size, precession, num_heads.
I hope this parameter is known to the reader. For seq_length some users might know it as the context length, which is the number of tokens in each sequence, and as we know this is the bottleneck within the attention mechanism. The hidden_size is the parameter that controls the residual connection within the Decoder layer. Precession is the byte requirement for each parameter ‘float 32(4 byte), bfloat16 & float16 (2 byte)’. The num_heads is the number of attention heads. You can search more about this term if you don’t know what they exactly mean.
if you open this link LlamaMLP you will jump to the implementation of the MLP layer of the Llama model and you will see that this module has three Linear layers as follows,
gate_proj (hidden_size, intermediate_size)
up_proj (hidden_size, intermediate_size)
down_proj (intermediate_size, hidden_size)
memory_model = 3* (hidden_size * intermediate_size) … (1)
we assume that the optimizer is AdamW for all computations, so the memory requirement for the optimizer is twice the model requirement as the Adam optimizer needs two metrics to compute the momentum. The gradient for this layer ‘all DNN’ is the same as memory_model.
gradient = memory_model …(2)
optimizer = 2* memory_model … (3)
The activation is what we compute during the forward pass in order to compute the gradient within the backward pass and all this activation is saved in each forward pass, so we have a constant amount of memory at each forward pass. let's compute the memory of activation for MLP. If we want to compute the activation for one token, then within the forward function we will have the following,
~ hidden_size + intermediate_size + intermediate_size = hidden + 2 * intermediate
those for one token if we want to do the same for the sequence of tokens we multiply by the sequence length
activation ~= seq_length* hidden_size + 2 * seq_length * intermediate_size
last thing usually when we train the model we use mini-batch so we will multiply the previous equation with the batch_size.
activation= batch_size*seq_length*( hidden_size + 2 * intermediate_size)…(4)
Now let’s sum up everything,
MLP memory = ((1) + (2) + (3) + (4)) * precession
MLP_memory =(6 * (hidden_size * intermediate_size) + batch_size * seq_length * (hidden_size + 2* intermediate_size)) * precession
e.x for Llama 7B model ,
hidden_size = 4096, inermediate_size= 11008, precession = 2 byte, seq_length = 2048, batch_size = 8
MLP_memoy ~= 1.3 GB, with this, we compute the memory requirement for one MLP layer.
Let’s move on to the Attention layer.
The same as before you can jump to the Attention layer implementation at this like LlamaAttention. The same as before we will split the total memory into 4 section model, gradients, optimizer, activation.
q_proj: (hidden_size, num_heads * head_dim)
k_proj: (hidden_size, num_key_value_heads * head_dim)
v_proj: (hidden_size, num_key_value_heads * head_dim)
o_proj: (hidden_size, hidden_size)
These are the Attention parameters, for num_key_value_heads we will not use it within computation for simplicity but it’s a way to reduce computation and memory for more details you can read about Group Attention. As we did before the model memory is
model_memory = 3 * hidden_size * num_heads + hidden_size²…(5)
gradients = model_memory …(6)
optimizer = 2* model_memory …(7)
As we mentioned above the activation is the most memory consumer, especially at the attention layer, if you take a look at the forward pass in the LlamaAttention module you will see these five matrices,
query_states: (batch_size, num_heads, seq_len, head_dim)
key_states: (batch_size, num_key_value_heads, seq_len, head_dim)
value_states: (batch_size, num_key_value_heads, seq_len, head_dim)
attn_weights: (batch_size, num_heads, seq_len, seq_len)
attn_output: (batch_size, seq_len, hidden_size)
head_dim = hidden_size // num_heads
The total memory requirement for activation is,
activation = batch_size * ( 3 * num_heads * seq_length * head_dim + num_heads * seq_length² + seq_length * hidden_size)
attention_memory = ( 6 * (hidden_size * num_heads + hidden_size²) + batch_size * ( 3 * num_heads * seq_length * head_dim + num_heads * seq_length² + seq_length * hidden_size))* precession
All previous computations are for one layer, so to compute the full memory requirement you need to multiply by the number of layers
total_model_memory = num_layers *(MLP_memory + attention_memory )
This is the memory you need to train a decoder-only mode, I have created a space llm-memory-requiremen on huggingface you can take a look at it and play around with the parameters to estimate what you need to train a model with different configurations.
In conclusion, this article shows how to estimate the memory requirement of a decoder-only model, we use Llama implementation to compute that, and we can notice the memory overhead from the attention computation that is quadratic with the seq_length.
Khalil Hennara
Ai Engineer at MISRAJ
Contact us to discover how Mesraj's technologies can transform the way your organization works.
Start your journey to smarter solutions