【DAILY READING】Deep Reinforcement Learning from Human Preferences

Posted Oct 30, 2023 Updated Nov 14, 2023

By washing 2 min read

Conclusion By Myself

This is a paper included in the class of RLHF. This paper published in 2017, it introduced a method of RL which is give two actions to a common human(non-expert), and then ask this human for prefer one. This method makes the feedback from human be easier, so the training of the complex model becomes easier too. Specifically, this paper desires to solve the follow problems:

enable us to solve tasks for which we can only recognize the desired behavior, but no necessarily demonstrate it
allows agents to be taught by non-expert users
scales to large problems
be economical with user feedback

Abstract

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than 1% of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to SOTA RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any which have been previously learned from human feedback.

Key Points By AI

Recent success in scaling reinforcement learning (RL) to large problems has been driven in domains that have a well-specified reward function
The agent learns about the goal of the task only by asking a human which of two trajectory segments is better
Feedback is provided by contractors who are given a 1-2 sentence description of each task before being asked to compare several hundred to several thousand pairs of trajectory segments for that task
There is a large literature of preference elicitation and RL from unknown reward functions, we provide the first evidence that these techniques can be economically scaled up to SOTA RL systems
Future work may be able to improve the efficiency of learning from human preferences, and expand the range of tasks to which it can be applied
In the long run it would be desirable to make learning a task from human preferences no more difficult than learning it from a programmatic reward signal, ensuring that powerful RL systems can be applied in the service of complex human values rather than low-complexity goals

File

[1706.03741] Deep reinforcement learning from human preferences (arxiv.org)

paper

daily reading

This post is licensed under CC BY 4.0 by the author.

Conclusion By Myself

Abstract

Key Points By AI

File

Trending Tags