Neural NetworksLight modelSelf-attentionDeterministic demoClassroom friendly

Transformers — Learn Self-Attention

This version uses a light explanatory model instead of heavy pseudo-math simulation. It shows how transformers use attention to connect important words across a sentence.

Learning path

Transformer steps

Move through the main ideas in the same order a student would usually learn them.

Current focus

1. Split Into Tokens

Transformers first break a sentence into small pieces called tokens so each part can be processed consistently.

Example sentence

Thecatsatonthemat

Focus token

Right now we are asking: when the model reads sat, which other tokens are most useful?

Choose scenario

Attention map

Which tokens matter most?

Darker tokens mean stronger attention from sat. This is a simple teaching model, not a live neural network.

Token

The

4% attention

Token

cat

34% attention

Token

sat

16% attention

Token

8% attention

Token

the

5% attention

Token

mat

33% attention

What this shows

When the model looks at "sat", it pays strong attention to "cat" and "mat" because they help explain who acted and where.

Multi-head idea

Real transformers do this several times in parallel. Each head can specialize in a different kind of relationship.

Head A: who is acting

Head B: what relates semantically

Head C: what refers to what

Encoder and decoder

Encoder

Reads the full input and builds context-rich token representations.

Decoder

Generates output step by step while looking back at the encoder output.

Why transformers work well

They can connect nearby and distant words directly instead of walking through a sentence one step at a time.

What this light model omits

Real systems use learned embeddings, query/key/value projections, many layers, and matrix math at large scale.

Teaching focus

This tool is designed to explain the idea of self-attention clearly before students move on to heavier model details.