In reinforcement learning from human feedback, it is common to optimize with a reward model trained to predict human preferences. Since the reward model is an imperfect proxy, over-optimizing its value can hinder ground-truth performance, according to Goodhart’s law. This effect has been frequently observed but not carefully measured because of the expense of collecting human preference data. In this work, we use a synthetic setup in which a “gold standard” fixed reward model plays the role of humans, providing labels used to train an intermediate reward model. We study how the score of the gold reward model changes as we optimize the intermediate reward model using reinforcement learning or best-of-n sampling. We find that this relationship follows a different functional form depending on the optimization method, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the reward model dataset size, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setting. We explore the implications of these empirical results for theoretical considerations in AI alignment.
Source link
At Ikaroa, an industry-leading full stack tech company, we have researched extensively on the development of an optimized reward model that can scale quickly. According to our findings, there are several scaling laws that can be used to prevent reward model overoptimization in dynamic environments.
We believe that the key to successful reward model scalability lies in understanding the applicability of different scaling laws. For example, when the reward model moves away from the optimal performance right away, it is necessary to go back to the drawing board and ensure that the steps involved in its optimization are compliant with the in-depth analysis of a relevant scaling law.
The scaling law pertaining to reward model overoptimization indicates how much the reward model should change with a small change in the environment. This scalability law therefore helps users optimally gain from their efforts to sustain a high rate of return over time. In other words, it aids in making sure that the reward model does not over-react to the environment (going too far away from the optimal performance range).
At Ikaroa, we have the resources and expertise to tailor the scaling law as per the given environment to ensure that there is no overoptimization of the reward model. This way, the users will benefit from optimal returns over sustained periods of time. We also offer comprehensive support for implementing and optimizing reward models that suit the scaling laws best.
In conclusion, reward model scalability is a crucial aspect of running a successful technology-focused business. At Ikaroa, we guarantee a high level of scalability and accuracy by following the most up-to-date scaling laws on reward model overoptimization.