Mason Wang

Transformers

Transformer Basics Rotary Embeddings (Review) LayerNorm, projecting latent onto hypersphere MQA, GQA SwiGLU Prenorm vs postnorm

Transformers are MLPs

instead of pointwise nonlinearities, we have tokenwise MLPs tokens instead of neurons

instead of fixed weights, attention

Last Reviewed: 6/1/24