Transformers — Learn Self-Attention
This version uses a light explanatory model instead of heavy pseudo-math simulation. It shows how transformers use attention to connect important words across a sentence.
Learning path
Transformer steps
Move through the main ideas in the same order a student would usually learn them.
Current focus
1. Split Into Tokens
Transformers first break a sentence into small pieces called tokens so each part can be processed consistently.
Example sentence
Focus token
Right now we are asking: when the model reads sat, which other tokens are most useful?
Choose scenario
Attention map
Which tokens matter most?
Darker tokens mean stronger attention from sat. This is a simple teaching model, not a live neural network.
Token
The
4% attention
Token
cat
34% attention
Token
sat
16% attention
Token
on
8% attention
Token
the
5% attention
Token
mat
33% attention
What this shows
When the model looks at "sat", it pays strong attention to "cat" and "mat" because they help explain who acted and where.
Multi-head idea
Real transformers do this several times in parallel. Each head can specialize in a different kind of relationship.
Encoder and decoder
Encoder
Reads the full input and builds context-rich token representations.
Decoder
Generates output step by step while looking back at the encoder output.
Why transformers work well
They can connect nearby and distant words directly instead of walking through a sentence one step at a time.
What this light model omits
Real systems use learned embeddings, query/key/value projections, many layers, and matrix math at large scale.
Teaching focus
This tool is designed to explain the idea of self-attention clearly before students move on to heavier model details.