Decoder-Only Transformer Architecture

Token IDs

["The", "cat", "sat"] → [1423, 812, 991]

Token + Positional Embedding

integer IDs → dense vectors

Transformer Block × N

Masked Self-Attention

multi-head, causal

Add & Norm

residual + layer normalization

Feed-Forward Network

Linear → GELU → Linear

Add & Norm

residual + layer normalization

Linear Projection

hidden dim → vocabulary size

Softmax + Sample

next token probabilities

← Click any component in the diagram to explore it