Mason Wang

Language Modeling from Scratch

BPE Few-Shot/Zero Shot Generalization Scaling Laws - parameters, data, training time, result in linear log-log curves with loss

Last Reviewed: 6/1/24