Mason Wang

LLMs

reinforcement learning, give rewards

Language = special, natural domain for studying AI, this is the modality of the frontier.

supervised learning

Scale.

People are very conservative due to cost of training

post-training

Now there are many stages:

pretraining - base data mixture, more code data-mixture, even more code + synthetic data mixture

midtraining - context expansion, reasoning heavy

post-training: SFT, DPO / RL

open research open models open source (not open weights) like llama 1-4 open development

tokenization - word-based - misspellings, will encounter things you haven’t seen beofre.

characters - lots of sequences. representation for “c” not very meaningful.

subword tokens.

Pre-training

Don’t use BERT 1 - only learning about the masked word, inefficient 2 - if you want to make it useful, need to generate text. instead of classifying something, just give the word of the class. 3 - encode all tokens during scratch

web knowledge - factual, syntax + coreference, sentiment, math, prompting.

design structures into your weights

start with some links, hop to adjacent links - most of the web is gibberish oversample wikipedia, arXiv, GitHub, Reddit, StackExchange. People ask questions on reddit, multiple answers, upvotes, links - pages linked to from reddit posts or wikipedia.

can ask an LLM

Long Contexts

pretraining = 4000 word context window midtraining - longer, million/half a million tokens

100 H100s = small LLMs

Babysit training runs, look for spikes, track evals.

Tweaks to transformer

how to budget compute

Last Reviewed: 10/26/2025