Punishing AI doesn’t stop it from lying and cheating — it just makes it hide better, study shows (Live wire)

irobotic math splash

By using a standard LLM, GPT-4o, to oversee an unreleased frontier reasoning model during training, the researchers watched as the model completed a number of coding tasks.

But in some of these cases, the reasoning model found it easier and more desirable to reward hack, explicitly stating it was doing so in its chain-of-thought. In one instance, it bypassed its assigned task by terminating programs prematurely and reporting success. In another, it mimicked the expected files to be output by a task in name alone, leaving them empty and nonfunctional. – Ben Turner

Hakeem Ali-Bocas Alexander