Back to posts
T5 – Text-To-Text Transfer Transformer

T5 – Text-To-Text Transfer Transformer

Niket Girdhar / July 9, 2025

T5 (Text-To-Text Transfer Transformer) is an innovative model by Google that reframes all Natural Language Processing (NLP) tasks into a unified text-to-text format. Whether it’s translation, summarization, or classification, T5 treats both inputs and outputs as plain text, enabling a highly flexible architecture.

Unlike earlier models that use only encoders (like BERT) or only decoders (like GPT), T5 leverages the full Transformer architecture with both an encoder stack and a decoder stack. This makes it capable of handling a wide range of tasks seamlessly.


Why T5 is Different

  • Encoder-only Models (e.g., BERT):
    Great for understanding tasks like text classification or question answering but not suitable for text generation.

  • Decoder-only Models (e.g., GPT):
    Ideal for text generation and conversational AI but less efficient for understanding tasks.

  • T5’s Encoder-Decoder Architecture:
    Combines the strengths of both approaches for a unified framework. It can handle any NLP task by casting it as a text-to-text problem.


Pre-trained Tasks in T5

T5 is pre-trained on a diverse set of text-to-text tasks:

  • Translation: Convert text between languages (e.g., English to German).
  • Summarization: Condense long articles into concise summaries.
  • STSB (Semantic Textual Similarity Benchmark): Measure how similar two sentences are in meaning.
  • CoLA (Corpus of Linguistic Acceptability): Judge whether a sentence is grammatically acceptable.

By unifying these tasks, T5 simplifies transfer learning and fine-tuning.


Key Innovations in T5

  • Sentinel Tokens: Unlike BERT, which masks single tokens, T5 can mask and predict multiple spans of text at once.
  • Deshuffling Objective: Trains the model to reorder shuffled sequences, improving its understanding of text structure.
  • Cross-Attention in Decoder: Lets the decoder attend to encoder outputs, providing context-aware generation capabilities.

Training Objectives

T5 uses multiple pre-training strategies:

  • Masked Span Prediction: Predict masked spans in input text.
  • Autoregressive Generation: Generate text one token at a time, like GPT.
  • Deshuffling: Restore original word order in randomly shuffled input.

Use Cases for T5

  • Machine Translation
  • Question Answering
  • Abstract Summarization
  • Text Simplification
  • Grammar Correction

T5’s encoder-decoder architecture combines the understanding of BERT and the generation power of GPT, making it one of the most versatile NLP models to date.