Few-Shot Preference-Based RL
-
Start with an existing reward function, every k-steps, they ask for 10 new responses from their expert, who specifies new preferences based on their observation.
- Update the reward model:
- at each fine-tuning step, start from reward model, and fine-tune based on ALL human annotations collected during the course of policy training
- reduces amount of data by 20x.
- Meta-World - what are the human preference? If the task is backflip, the human is asked if the robot has a good backflip or not.
- what about training the policy using expert-annotated rewards? Much more data inefficient, need more than 10 annotations.