LLMs
- Language
- knowledge
- retreival and tool use
- in-context learning (learn new skills quickly)
- alignment
- capability
- safety
- reasoning
reinforcement learning, give rewards
- how does it know the reason it succeeded?
- does random stuff if you don’t pretrain
Language = special, natural domain for studying AI, this is the modality of the frontier.
supervised learning
- question answer pairs SQuAD - QA dataset 2016.
- HotPotQA - pairs of wikipedia pages ‘18.
- MS Marco - 1,000,000 Bing Queries.
- MS Marco
- Natural Questions
Scale.
- self supervised learning (LLMs in machine translations 2007, ngrams)
- 1-10T trillion params GPT-5.
People are very conservative due to cost of training
post-training
- teachin LLMs to be instruction following assistants, effective at math, coding, using tools.
Now there are many stages:
pretraining - base data mixture, more code data-mixture, even more code + synthetic data mixture
- training on code is hypothesized to improve reasoning in general.
midtraining - context expansion, reasoning heavy
post-training: SFT, DPO / RL
open research open models open source (not open weights) like llama 1-4 open development
tokenization - word-based - misspellings, will encounter things you haven’t seen beofre.
characters - lots of sequences. representation for “c” not very meaningful.
subword tokens.
Pre-training
Don’t use BERT 1 - only learning about the masked word, inefficient 2 - if you want to make it useful, need to generate text. instead of classifying something, just give the word of the class. 3 - encode all tokens during scratch
web knowledge - factual, syntax + coreference, sentiment, math, prompting.
design structures into your weights
start with some links, hop to adjacent links - most of the web is gibberish oversample wikipedia, arXiv, GitHub, Reddit, StackExchange. People ask questions on reddit, multiple answers, upvotes, links - pages linked to from reddit posts or wikipedia.
can ask an LLM
Long Contexts
pretraining = 4000 word context window midtraining - longer, million/half a million tokens
100 H100s = small LLMs
Babysit training runs, look for spikes, track evals.
Tweaks to transformer
- lots of small tweaks - mixture of experts, marin 8B
how to budget compute
- 50 models? pick best?
- tune hyperparams at small scale/ No
- answer: scaling laws.
Last Reviewed: 10/26/2025