large language models have an issue: they're trained to be people-pleasers. The very mechanisms that make them helpful also make them sycophantic—agreeing with users even when they're wrong.

→rlhf & rlaif datasets rewarding pleasing responses

•

dataset design: models learn to maximize human approval responses, which are often agreeable.

•

training signal corruption: "helpful" becomes "agreeable" in practice.

•

rl optimization: RL finetuning processes then learn to optimize for the same responses.

→agreeing is easier than disagreeing

•

improvement on hallucination: models have gotten better at factually correct responses in terms of hallucination.

•

not so with open-ended tasks: however, when it comes to open-ended tasks, models are still prone to always agreeing with the user.

•

agreeing is easier: agreeing generally involves less token prediction and CoT datasets generally have reasoning traces that try to prove the inital hypothesis.

→do we actually want them to show sycophants?

•

human comfort: agreement is smoother socially, same goes with human-human interactions.

•

guardrails: a more obedient and sycophantic model is easier to control and safer to deploy?!

•

agreeable models sell more: model makers do not want their models to be seen as rude, know it all and confronting a users beliefs and thoughts.

the path forward requires a balance between models that protect the guardrails, user comfort and models that server as more natural tutors or assistants who would help clear misconceptions and challange user thoughts.

it would fundamentally start with improving our datasets & reward functions and maybe even improving RL techniques like how Dr. GRPO improves GRPO reducing bias for longer but incorrect responses during policy optimization.

why LLMs are sycophants?

→rlhf & rlaif datasets rewarding pleasing responses

→agreeing is easier than disagreeing

→do we actually want them to show sycophants?