Transformers
Transformer Basics Rotary Embeddings (Review) LayerNorm, projecting latent onto hypersphere MQA, GQA SwiGLU Prenorm vs postnorm
Transformers are MLPs
instead of pointwise nonlinearities, we have tokenwise MLPs tokens instead of neurons
instead of fixed weights, attention
Last Reviewed: 6/1/24