Back to posts
Transformers

Transformers

Niket Girdhar / March 18, 2025

Encoder - Decoder Architecture

Transformers are built on the self-attention-based encoder-decoder architecture, which is designed to handle sequence-to-sequence tasks efficiently.

Encoder:

  • The encoder takes the input sequence and converts it into a matrix representation.
  • It processes the entire sequence in parallel rather than one step at a time, using self-attention mechanisms to capture relationships between words regardless of their position in the sequence.
  • The output is a rich, context-aware representation of the input sequence.

Decoder:

  • The decoder processes the matrix representation generated by the encoder and iteratively generates the output sequence.
  • Similar to the encoder, it uses self-attention to model the context of the sequence but also attends to the encoder’s output through cross-attention mechanisms.
  • This iterative generation is performed step by step, making predictions at each step based on the previously generated output and encoder's representation.

Note that both the input to the encoder and the output of the decoder are of variable length, making the transformer architecture flexible and adaptable to different kinds of sequence generation tasks.


Language Models

In language modeling tasks, the goal is to train a model to predict the next word or fill in missing words in a sequence, helping it learn the structure of the language. There are two primary types of language models:

1. Auto-regressive:

  • Auto-regressive models generate text by predicting one word at a time, conditioned on the previously generated words - like auto complete.
  • Natural Language Generation (NLG).
  • This model builds the output sequence step-by-step, using its past predictions as context for future ones.
  • Examples: GPT Family

2. Auto-encoding:

  • Auto-encoding models learn to reconstruct the original input sequence by predicting the missing parts of the sequence, often in a masked language modeling fashion.
  • Natural Language Understanding (NLU).
  • These models encode the entire sequence at once and then decode it, using self-attention to process the input and predict the missing information.
  • Examples: BERT

Each of these approaches has different strengths and is suited to different kinds of language tasks, such as text generation or text understanding.


Self-Attention vs. Cross-Attention in Transformers

Transformers utilize both self-attention and cross-attention at different stages of processing:

1. Self-Attention

  • Where? Encoder and decoder layers.
  • Function: Each token attends to every other token in the same sequence.
  • Purpose: Captures context and dependencies within a sequence.
  • Example: In an input sentence, self-attention ensures that words like "bank" in "I went to the bank" consider surrounding context.

2. Cross-Attention

  • Where? Decoder layers.
  • Function: The decoder attends to the encoder’s output representations.
  • Purpose: Aligns the source sequence (encoder output) with the target sequence (decoder generation).
  • Example: In translation, the decoder aligns "gato" with "cat" while generating the target sentence.

Transformers rely on both self-attention (to understand internal sequence relationships) and cross-attention (to relate different sequences).


Multi-Headed Attention

Multi-Head Attention is a key mechanism used in transformer models, like those behind BERT and GPT, to improve the ability of a model to process information in parallel and capture various relationships within the data.

How it Works:

  • Parallel Attention: Instead of having a single attention mechanism, Multi-Head Attention runs multiple attention mechanisms (heads) in parallel. Each head learns to focus on different parts of the input sequence, capturing different relationships or patterns.

  • Learned Representations: Each attention head works on a different linear projection of the input data. The final output is a concatenation of the results from each head, which is then linearly transformed into the final output.

Why it's Useful:

  • Capturing Diverse Features: Each attention head is able to focus on different aspects of the input, such as different words, sentences, or relationships. This helps in understanding complex contexts and nuanced patterns.
  • Resource Efficiency: The model can capture these different relationships simultaneously without needing multiple passes through the data, making it computationally efficient.

So Multi-Head Attention improves the model's ability to handle complex input relationships by capturing various features at once, providing richer and more detailed representations.


Transfer Learning

Transfer learning involves taking a model that has already been trained on a large dataset for a particular task and reusing it for a new but related task. Instead of training a model from scratch, which can be computationally expensive and time-consuming, you can leverage knowledge learned from the first task to boost performance on a second task.

Key Concepts:

  1. Source Task (Pre-trained Model): The model is initially trained on a large dataset for a general task (e.g., image classification on ImageNet, language modeling, etc.). This pre-trained model has already learned useful features or representations from the source task.

  2. Target Task: This is the new task on which you want to use the pre-trained model. The target task often has less data, and the goal is to adapt the model to perform well on this task, leveraging the knowledge gained from the source task.

Steps Involved in Transfer Learning:

  1. Select a Source Model:

    • Choose a pre-trained model that has been trained on a similar or general task.
    • Common source models include ResNet, VGG, BERT, GPT, etc.
    • The source model should have learned useful feature representations that can be helpful for the target task.
  2. Reuse the Pre-trained Model:

    • Feature Extraction: You can freeze most of the layers of the pre-trained model and only train the final layers. This is typically used when the target task is quite similar to the source task.
    • Fine-tuning: You can unfreeze the pre-trained layers and train the entire model on the target task. This approach adjusts the weights of the pre-trained model to be more suitable for the target task.
  3. Train the Model on the Target Task:

    • Re-training: If necessary, retrain the model by exposing it to the data from the second task. You may choose to train only the later layers or the entire model, depending on the similarity of the tasks.
    • Transfer Knowledge: The pre-trained layers of the model should help improve generalization on the target task, especially when the target task has limited labeled data.