Token + Positional Embedding
integer IDs → dense vectors
Transformer Block × N
Masked Self-Attention
multi-head, causal
Add & Norm
residual + layer normalization
Feed-Forward Network
Linear → GELU → Linear
Add & Norm
residual + layer normalization
Linear Projection
hidden dim → vocabulary size
Softmax + Sample
next token probabilities