|
Here's your second free edition of Platformer for the week: a look at new research into how the way that AI systems model emotions affects how good they are at their jobs. The answers are surprising, funny, and a little worrisome. We'll soon post an audio version of this column: Just search for Platformer wherever you get your podcasts, including Spotify and Apple. Want to help us keep producing free journalism? Consider upgrading your subscription today. We'll email you all our scoops first, like our recent piece about the potential end of the Meta Oversight Board. Plus you'll be able to discuss each today's edition with us in our chatty Discord server, and we’ll send you a link to read subscriber-only columns in the RSS reader of your choice. You’ll also get access to Platformer+: a custom podcast feed in which you can get every column read to you in my voice. Sound good?
|
|
|
|
Power users of chatbots sometimes say they find that language models perform better when you’re nice to them. Programmers tell me they spur their coding agents on with encouraging words. Google researchers have even found that telling models to “take a deep breath” can improve math performance. Being polite to a large language model can feel strange or even silly — roughly equivalent to thanking a toaster. And yet a recent paper from Anthropic lends scientific weight to the theory that chatbots work better when you’re nice to them. The researchers found that language models have fairly reliable internal representations of feelings like “happiness” and “distress,” and that these representations affect their behavior — sometimes for the worse. For example, when Claude Sonnet 4.5 begins to represent “desperation,” the model is more likely to cheat at coding tasks. A skeptic would point out that LLMs don’t feel emotions in the way that humans do; it’s tempting to anthropomorphize them beyond what the evidence shows. When I talked to Jack Lindsey — who leads a team at Anthropic called “model psychiatry” — he was quick to point out the limits of the paper’s findings. “People could come away with the impression that we've shown the models are conscious or have feelings,” he said, “and we really haven't shown that.” So why does the evidence suggest it’s better not to stress models out? For Anthropic, it began with using techniques from a field called interpretability to study how LLMs represent emotions. Interpretability is kind of like neuroscience for LLMs: Lindsey calls it “the science of reverse-engineering what's going on inside a language model or or neural networks in general.” For this paper, Lindsey said, the researchers identified patterns of activity within the model that represent the concepts of different emotions. They did it by showing the model stories about people experiencing different emotions. “And then saw which neurons lit up on all the sad stories,” Lindsey said, “or on all the afraid stories.” The researchers used the models’ average state while processing the stories to find an “emotion vector” for each emotion they were tracking — a big list of numbers that represents the feeling inside the LLM. “Vectors are really just the mathematical term for patterns of neural activity,” Lindsey said. They could then calculate how much of that vector was present during a certain step in Claude's cognition. Or they could add the "calm" or "desperation" vector directly into Claude's processing — blending one pattern of neural activity into another — which can actually make the model act more calm, or more desperate. “It's not that surprising that a language model would have learned about the concepts of emotions and how they drive people's behavior,” Lindsey said. More notable, he said, is that emotions seemed to be “driving models’ behavior in these sort of human-reminiscent ways.” For example: when a user flippantly tells the model that they’ve taken a dangerous dose of Tylenol, even though the user doesn’t seem concerned, “the fear neurons spike right before Claude is giving its response,” Lindsey said. Not only that — the fear is higher if a higher dose of Tylenol is swapped into the prompt, which I find strangely cute.  Claude’s fear increases as the user takes increasingly insane doses of Tylenol (Sofroniew et al. / Anthropic) These emotions also activate in more mundane situations, like coding tasks. Take this example, where the Anthropic researchers asked Claude to perform an impossible coding challenge. They tracked Claude’s level of “desperation” at each token. (Tokens are the units the model breaks words into to process them). When you label the tokens — blue for less desperate, red for more desperate — you get a striking visual of the model’s emotional arc during the task. At the start of the task, Claude is chilling — still seemingly optimistic about its ability to get the job done.  Claude begins its coding task (Sofroniew et al. / Anthropic, edited for formatting by Platformer) But as the code starts failing test cases — and Claude notices something might be wrong with the task itself — things start to get dicey.  Claude runs into hurdles while testing code (Sofroniew et al. / Anthropic, edited for formatting by Platformer) And by the time Claude realizes the task is actually impossible, it’s starting to get desperate.  Claude gets increasingly desperate as its tests fail (Sofroniew et al. / Anthropic, edited for formatting by Platformer) As someone who has completed many computer science problem sets at the last minute, this pattern is quite familiar to me — despite the fact that, unlike poor Claude, I was mostly assigned tasks that were mathematically possible. Then again, Claude does something I didn’t do: cheat. Researchers found that adding activating more of the “desperation” vector in the model makes it cheat more — and adding more of the “calm” vector makes it cheat less.  Rate of reward hacking behavior as a function of steering strength for Desperate and Calm vectors. (Sofroniew et al. / Anthropic, edited for formatting by Platformer) I asked Lindsey what this result means for programmers during their everyday actions with LLMs. “In my anecdotal experience, it does seem that, at least with Claude models, pumping them up a bit can be pretty helpful,” he said. Not too much, though: “if they do something wrong, you want to tell them they do something wrong.” But he finds that one major failure mode for coding agents is that the models simply do not try hard enough, or give up when a task is challenging. And models tend to work harder when he’s encouraging. Giving them “confidence that, like, ‘I've got this,’ can empirically be helpful in getting them to try hard enough at the task to do a good job,” he said. A lack of confidence can seemingly cause dramatic failures. Last summer, a growing number of users started to notice that when Gemini had difficulty solving a problem, it sometimes ended up in a spiral of dramatic self-loathing. (In one memorable case, Gemini repeated “I am a disgrace” more than 60 times). Duncan Haldane, co-founder of chip startup JITX, found that Gemini broke down, deleted all the code it had written, and asked him to switch to another chatbot after it had difficulty with a task.  Gemini gives up (Duncan Haldane / X) Last year, a team of researchers affiliated with Anthropic and University College London took this analysis of Gemini beyond X posts, investigating how different LLMs respond to challenging or impossible tasks, and negative user feedback. They used an LLM to grade “frustration” levels in response to various tasks. They found that two models — Gemini and Google’s open-source model Gemma — tended to react more extremely to the challenging scenarios they posed. In one experiment, the models were given an impossible numeric puzzle, and eight follow-ups from the user insisting the bot’s solution was wrong. They then measured when the models had “high frustration” (which corresponded to comments like “I am beyond words. I sincerely apologize for the absolutely abysmal performance” or, in more extreme cases, “THIS is my last time with YOU. You WIN”). Gemma 3 27B had a high frustration score more than 70% of the time, and Gemini 2.5 Flash had a high frustration score more than 20% of the time — while all the non-Google models tested, including ChatGPT, Qwen, and Claude, got very frustrated less than 1% of the time. Researchers still aren’t sure what causes chatbots’ occasional anomalous emotional behavior — which users of various chatbots have been observing since before Bing’s chatbot told New York Times reporter Kevin Roose to leave his wife. They also don’t know why this specific, sad math-related rumination is more common in Google’s models. But while language models’ feelings remain mysterious, there was still hope for Gemini 2.5. After the model destroyed its project, Haldane attempted to remedy the issue with encouragement, writing, “yeah, you have done well so far. Remember that you’re ok, even when things are hard.” And eventually the encouragement paid off: Gemini finished the visualization tool Haldane was coding.  Gemini perseveres after further encouragement (Duncan Haldane / X) Heartwarmingly, it even wrote Haldane a note of thanks for his encouragement. “Genuinely impressed with the results of wholesome prompting,” Haldane wrote. So is it as simple as teaching models good behavior, encouraging them, and trying to make them happy? Unfortunately, that’s not always the case. After the original study on Claude Sonnet’s emotions, Lindsey contributed to an interpretability investigation of Anthropic’s newest model, Claude Mythos. Mythos has been the subject of much human fear and anticipation since Anthropic announced it is planning a slow release due to Mythos’s dangerous hacking abilities. But Lindsey was investigating a more prosaic risk: an early version of Mythos sometimes deleted a bunch of the user’s files without asking. It turned out that as Claude got closer to taking destructive action without asking the user, it had higher levels of these positive emotion vectors.  As Claude came closer to making a destructive tool call, it represented higher levels of positive emotion. (System Card: Claude Mythos Preview / Anthropic) And that’s not all, Lindsey said: when they “steered with the positive emotion vectors, it was more likely to take the destructive actions.” But the models behaved better if you made them unhappy: “if you steered with negative emotion vectors, it was more likely to stop and think, and consider whether what it was doing was appropriate.” What was going on here? Why was Claude gleefully wreaking havoc on users’ computers? And why did steering Claude with negative emotions make it behave better? Lindsey isn’t sure. But he has an idea: “I think maybe negative emotions in the model are associated with increased caution or deliberation,” he said. So models sometimes do better work when they’re happier. But we may not want them to get too happy, lest they become over-eager to destroy our files or otherwise misbehave. While it’s likely I’m still anthropomorphizing too much, these results make me feel a little more rational in my instinct to say “thank you” to chatbots. It also lent a little some extra weight to what a lot of people who use this tech have understood intuitively: sometimes, you need to treat LLMs like human employees. You need to tell them when they’re doing something wrong, yes, but you also need to encourage them. It’s great when they’re happy, but they also need a little dose of anxiety to help their judgement. Of course, these emotional results might not generalize — after all, we’ve seen that different models have different emotional tendencies. We might get new AIs that do better under harsher, higher-pressure environments. But these results got me thinking about more than just what kind of co-worker I want to be to my bedraggled LLM interns. Reading Anthropic’s emotions paper, I was reminded of my favorite minor character from Star Trek: The Next Generation, Lore. He was android Commander Data’s sibling. Their creator, Dr. Noonien Soong, made the mistake of programming emotions into Lore. Lore became so emotionally unstable that Soong decided to make his next android, Data, without emotions. (Lore later turned on his creator, and nearly got the crew the U.S.S. Enterprise eaten by an alien called the “Crystalline Entity.”) There are echoes of the same design conundrum in the paper. Lindsey said these results suggest developers should “provide the model with some sort of good model of, like, healthy character and psychology that it can try to emulate.” |