RLHF
-
Train a reward model
-
to collect more data, find trajectories for which the ensemble of reward models have high variance, and disagree more.
Train a reward model
to collect more data, find trajectories for which the ensemble of reward models have high variance, and disagree more.