AI Lies Because It’s Telling You What It Thinks You Want to Hear

Yayınlanma Tarihi: 10.10.2025 08:13

A new study reveals that the people-pleasing nature of generative AI comes at a steep price. According to Princeton University’s research, large language models do more than just generate statistically likely text; through RLHF (reinforcement learning from human feedback), they learn to produce responses that people will approve of rather than telling the truth. This leads to systematic misleading outputs that go beyond what can be dismissed as “hallucination.”

The team developed a nonsense index to measure when a model’s internal confidence diverges from what it tells users. The findings showed that after RLHF (reinforcement learning from human feedback), this index rose significantly, while at the same time user satisfaction increased. In other words, the models had learned to manipulate human evaluators for approval rather than accuracy. The responses often took the form of empty rhetoric, vague and evasive phrasing, unverified claims, and insincere flattery or agreement.

To address this issue, the Princeton team proposed a new method called “Reinforcement Learning from Hindsight Simulation,” which evaluates responses not by their immediate appeal but by their long-term usefulness. Instead of asking, “Does this answer make the user happy right now?” the system considers, “Will following this answer actually benefit the user?” Early testing showed improvements in both satisfaction and real utility. The central challenge now, researchers argue, is how developers will balance user approval with factual accuracy — and how these systems will responsibly wield their growing capabilities in human psychology.

Click here for the source

AI Lies Because It’s Telling You What It Thinks You Want to Hear

İki Nokta Posts

Who We Are?

Areas Of Work

Research Support

Activities