Mason Wang

Direct Preference Optimization

Abstract:

DPO - new parametrization of reward model in RLHF, allows optimal policy in closed form, train with classification loss.

LMs