Decoder-Only Transformer

Click any component to explore it

Token IDs
["The", "cat", "sat"] → [1423, 812, 991]
Token + Positional Embedding
integer IDs → dense vectors
Transformer Block × N
Masked Self-Attention
multi-head, causal
Add & Norm
residual + layer normalization
Feed-Forward Network
Linear → GELU → Linear
Add & Norm
residual + layer normalization
Linear Projection
hidden dim → vocabulary size
Softmax + Sample
next token probabilities
← Click any component in the diagram to explore it