Listening to
Artificial Intelligence
Machine Learning
Deep Learning
LLM
Transformers
Transformers Explained — Why They Changed Deep Learning Forever
The architecture that made machines better at language, vision, and pretty much everything else.
TRACTION
iamtraction
6 min readMay 3, 2023

"Attention is all you need." — Vaswani et al., 2017 1

When researchers at Google Brain published that paper title in 2017, it sounded bold. Maybe even smug. But they were right. Transformers didn't just improve how we do NLP; they changed the trajectory of deep learning entirely.

Today, transformers are the backbone of models like GPT and BERT and used by nearly every cutting-edge AI model in language, vision, and beyond. Let's unpack why this architecture became the de facto foundation for modern machine learning.

What is a Transformer?

At its core, a transformer is a neural network architecture built around a mechanism called self-attention. It's a way for the model to evaluate relationships between all tokens in an input sequence simultaneously, assigning dynamic weights to each one.

No recurrence, no convolutions. Just attention.

While the original transformer architecture is built entirely on attention mechanisms (no recurrence, no convolutions), later variants sometimes reintroduce convolution (e.g., for local structure in images or speech) or recurrence (e.g., in audio models like Conformer2) for domain-specific improvements.34

A Quick Look Back

Before transformers, deep learning relied heavily on recurrent or convolutional architectures:

  • RNNs and LSTMs – Good at handling sequential data but struggled with long-term dependencies, were slow to train, and hard to parallelize.
  • CNNs – Efficient and spatially aware, especially in vision tasks, but poorly suited for modeling sequences or variable-length inputs.
  • Encoder-decoder RNNs – Enabled early breakthroughs in translation and sequence generation but inherited the same limitations from their recurrent structure.

All of these architectures faced challenges with long context, training efficiency, and generalization.

The Breakthrough

The key innovation in transformers is the self-attention mechanism. Instead of processing sequences one step at a time, the model compares every token to every other token in the input, all at once.

Self-attention allows the model to:

  • Compute pairwise interactions between all tokens in a sequence, enabling dynamic context-aware representations.
  • Capture dependencies regardless of position, so tokens far apart in the sequence can directly influence each other.

Let's take the following sentence as an example,

"The animal didn't cross the street because it was too tired."

The word "it" could refer to either "animal" or "street". Transformers can simultaneously weigh both options and use global context to resolve ambiguity.

This mechanism is efficient, scalable, and forms the core of every transformer block.

Key Components

Transformers are built from a few foundational components that work together to model complex relationships within sequences.

1. Multi-Head Self-Attention

Rather than attending to input tokens with a single mechanism, transformers use multiple parallel attention heads. Each head learns different types of relationships (e.g., syntactic, semantic) by projecting the inputs into different subspaces and performing scaled dot-product attention.

This allows the model to capture diverse aspects of the sequence in parallel.

2. Positional Encodings

Since transformers process input tokens simultaneously (not sequentially), they lack an inherent sense of order. To address this, positional encodings are added to token embeddings. These can be:

  • Fixed (sinusoidal) as in the original paper, or
  • Learnable, which are more common in later variants.

This gives the model information about the position of each token in the sequence.

3. Feed-Forward Networks (FFNs)

After the attention step, each token representation is independently passed through a position-wise feed-forward network. Typically a two-layer MLP with a non-linear activation like ReLU or GELU.

This adds non-linearity and enables the model to mix and transform the attended representations.

4. Residual Connections and Layer Normalization

Each sub-layer (attention and FFN) is wrapped with:

  • A residual (skip) connection, which helps with gradient flow and combats vanishing gradients in deep networks.
  • A LayerNorm operation, applied either before (Pre-LN, now common for stability) or after (Post-LN, as in the original transformer) the sub-layer, depending on the implementation.

These components stabilize training and accelerate convergence.

Architecture

The original transformer architecture is made up of two main components:

  • Encoder – Processes an input sequence to produce contextual representations.
  • Decoder – Uses those representations to generate an output sequence, one token at a time, with attention over both the encoder output and previously generated tokens.

This encoder-decoder setup is ideal for tasks like machine translation and summarization.

Over time, different configurations of the transformer architecture emerged:

  • Encoder-only models (e.g., BERT) – Focused on understanding and representation. They process input bidirectionally and are used for classification, question answering, and embedding tasks.
  • Decoder-only models (e.g., GPT) – Generate sequences autoregressively using only causal attention. They're used for text generation, code completion, and reasoning tasks.
  • Encoder-decoder models (e.g., T5, BART) – Combine both for more complex tasks like summarization, translation, and instruction following.

A single transformer block contains:

  • Multi-head self-attention
  • Position-wise feed-forward network (FFN)
  • Add & Norm layers (residual connections + layer normalization)
  • Positional encodings to inject order into the sequence

And the full model typically stacks N such blocks in sequence:

[Input] → [Token + Positional Embeddings] → [Attention → FFN → Norm] × N → [Output]

In encoder-decoder models, the decoder additionally attends to encoder outputs:

[Encoder Input] → [Encoder Blocks] → [Context]
                                ↓
[Decoder Input] → [Masked Attention → Encoder Attention → FFN → Norm] × N → [Generated Output]

This flexibility in structure is what makes transformers so widely applicable across modalities and tasks.

Why it Changed the Game

1. Parallelizable Training

Unlike RNNs, transformers process all tokens simultaneously. That means:

  • Faster training (massively faster on GPUs/TPUs)
  • Better scalability across datasets

2. Long-Range Context

They handle long sequences better than previous architectures. This made it possible to understand paragraphs, documents, and even code.

3. Transfer Learning Sweet Spot

Transformers enabled pretraining on massive data, then fine-tuning on smaller, task-specific datasets.

Pretrain on large-scale internet text (e.g., Common Crawl, books, forums), then fine-tune to adapt for downstream tasks like chatbots, summarization, or code generation.

4. Multimodal Compatibility

Transformers work not just on text, but also:

  • Images (Vision Transformers, CLIP)
  • Audio (Whisper, Wav2Vec)
  • Video (TimeSformer, ViViT)
  • Code (CodeBERT)

One architecture. Many modalities.

Real-World Impact

  • OpenAI GPT series → Natural text generation and reasoning
  • Google BERT → Search and context-aware NLP
  • Meta DINOv2 → Self-supervised visual learning
  • DeepMind AlphaCode → Research system that approaches competitive programming performance using transformer-based code generation

It's not just research. It's product-critical.

Are Transformers Perfect?

Nope. Some challenges remain, like:

  • Quadratic memory cost with long inputs due to the self-attention mechanism computing pairwise interactions between all tokens in a sequence. Though workarounds like FlashAttention5, Longformer6, Reformer7, and Performer8 are emerging.
  • Training cost is huge.
  • Tokenization limitations – Subword tokenization mitigates out-of-vocabulary issues but can still introduce inefficiencies, inconsistencies across languages, and challenges with rare or complex morphology. Recent research explores byte-level and tokenizer-free models to improve this.
  • Limited context memory – Transformers have a fixed context window and do not retain information across separate sequences unless augmented with external memory mechanisms (RAG or vector databases are emerging).

But for all their flaws, the power-to-flexibility ratio is unmatched.

Final Thoughts

Transformers didn't just solve problems, they redefined what problems we could solve. They scaled with hardware, played well with data, and generalized across tasks and modalities.

In deep learning, it's rare for a single architecture to become the default across disciplines. Transformers did that.

And they're still evolving.

Footnotes

  1. https://arxiv.org/abs/1706.03762

  2. https://arxiv.org/abs/2005.08100

  3. https://arxiv.org/abs/2106.04803

  4. https://arxiv.org/abs/2104.01136

  5. https://arxiv.org/abs/2205.14135

  6. https://arxiv.org/abs/2004.05150

  7. https://arxiv.org/abs/2001.04451

  8. https://arxiv.org/abs/2009.14794

...
Logo
© 2025 - TRACTION