Back to posts
GPT – Generative Pre-trained Transformers

GPT – Generative Pre-trained Transformers

Niket Girdhar / May 2, 2025

GPT

GPT (Generative Pre-trained Transformer) is an autoregressive language model that has become a cornerstone in natural language generation. Developed by OpenAI, GPT is built using only the decoder blocks of the Transformer architecture, enabling it to generate text by predicting one token at a time based on prior context.

Here’s what makes GPT powerful:

  • Autoregressive: GPT generates the next token using only the leftward (past) context, ideal for generation tasks like storytelling or conversation.
  • Decoder-only Transformer: Unlike encoder-based models like BERT, GPT uses masked self-attention to prevent access to future tokens during training.
  • Self-Attention: Utilizes scaled dot-product attention to understand token relationships across long sequences.
  • Byte-level Tokenization: Allows flexibility in handling multiple languages and special characters by encoding text at the byte level.

Key Components:

  • <|endoftext|> Token: Special token used to mark the end of a sequence.
  • Token & Positional Embeddings: Combines context-free token embeddings and position embeddings to represent input.
  • No Segment Embeddings: GPT treats the input as a single continuous sequence, unlike BERT which splits into segments (A and B).

GPT Models (Parameter Counts):

  • GPT-1: 117M parameters
  • GPT-2: 1.5B parameters
    • Small (117M)
    • Medium (345M)
    • Large (762M)
    • XL (1542M)
  • GPT-3: 175B parameters

Pre-Training GPT

GPT is pre-trained on a massive dataset using an autoregressive objective — predicting the next word in a sequence. Unlike BERT, GPT doesn’t use tasks like Masked Language Modeling or Next Sentence Prediction.

Pre-Training Objective:

  • Causal Language Modeling (CLM): Predicts the next token in a sentence by only attending to past tokens, enabling powerful generative capabilities.
  • Trained on WebText, a 40GB dataset scraped from Reddit — introducing vast but sometimes biased data.

Few-Shot, One-Shot, and Zero-Shot Learning

GPT models excel in flexible learning paradigms:

  • Zero-shot: Perform a task with no examples
  • One-shot: Perform a task after seeing one example
  • Few-shot: Learn from a handful (2–10) of examples within the context window

Context length for GPT-2: 1024 tokens


Fine-Tuning & Inference Parameters

GPT models can be fine-tuned for tasks like style imitation, code generation, or dialogue. Generation can be controlled using hyperparameters:

  • temperature – randomness/creativity
  • top_k, top_p – token sampling constraints
  • beams, do_sample – deterministic vs. stochastic generation
  • repetition_penalty, presence_penalty – reduce repetition and encourage novelty
  • stop_sequence – define stopping points for output

Limitations: Overfitting & Bias

GPT models are highly expressive and prone to overfitting, often memorizing text and exhibiting narrow generalization. Additionally, biases from the training data can manifest during generation. These challenges require careful model design, dataset curation, and tuning.


Use Cases for GPT:

  • Text generation
  • Code completion
  • Dialogue systems
  • Content summarization
  • Few-shot classification
  • Creative writing & storytelling

With its flexible decoder-based architecture and few-shot learning ability, GPT continues to push the boundaries of what language models can do.