At NeurIPS, Melanie Mitchell Says AI Needs Better Tests

When people want a clear-eyed take on the state of artificial intelligence and what it all means, they tend to turn to Melanie Mitchell, a computer scientist and a professor at the Santa Fe Institute. Her 2019 book, Artificial Intelligence: A Guide for Thinking Humans, helped define the modern conversation about what today’s AI systems can and can’t do.

Melanie Mitchell

Today at NeurIPS, the year’s biggest gathering of AI professionals, she gave a keynote titled “On the Science of ‘Alien Intelligences’: Evaluating Cognitive Capabilities in Babies, Animals, and AI.” Ahead of the talk, she spoke with IEEE Spectrum about its themes: Why today’s AI systems should be studied more like nonverbal minds, what developmental and comparative psychology can teach AI researchers, and how better experimental methods could reshape the way we measure machine cognition.

You use the phrase “alien intelligences” for both AI and biological minds like babies and animals. What do you mean by that?

Melanie Mitchell: Hopefully you noticed the quotation marks around “alien intelligences.” I’m quoting from a paper by [the neural network pioneer] Terrence Sejnowski where he talks about ChatGPT as being like a space alien that can communicate with us and seems intelligent. And then there’s another paper by the developmental psychologist Michael Frank who plays on that theme and says, we in developmental psychology study alien intelligences, namely babies. And we have some methods that we think may be helpful in analyzing AI intelligence. So that’s what I’m playing on.

When people talk about evaluating intelligence in AI, what kind of intelligence are they trying to measure? Reasoning or abstraction or world modeling or something else?

Mitchell: All of the above. People mean different things when they use the word intelligence, and intelligence itself has all these different dimensions, as you say. So, I used the term cognitive capabilities, which is a little bit more specific. I’m looking at how different cognitive capabilities are evaluated in developmental and comparative psychology and trying to apply some principles from those fields to AI.

Current Challenges in Evaluating AI Cognition

You say that the field of AI lacks good experimental protocols for evaluating cognition. What does AI evaluation look like today?

Mitchell: The typical way to evaluate an AI system is to have some set of benchmarks, and to run your system on those benchmark tasks and report the accuracy. But often it turns out that even though these AI systems we have now are just killing it on benchmarks, they’re surpassing humans, that performance doesn’t often translate to performance in the real world. If an AI system aces the bar exam, that doesn’t mean it’s going to be a good lawyer in the real world. Often the machines are doing well on those particular questions but can’t generalize very well. Also, tests that are designed to assess humans make assumptions that aren’t necessarily relevant or correct for AI systems, about things like how well a system is able to memorize.

As a computer scientist, I didn’t get any training in experimental methodology. Doing experiments on AI systems has become a core part of evaluating systems, and most people who came up through computer science haven’t had that training.

What do developmental and comparative psychologists know about probing cognition that AI researchers should know too?

Mitchell: There’s all kinds of experimental methodology that you learn as a student of psychology, especially in fields like developmental and comparative psychology because those are nonverbal agents. You have to really think creatively to figure out ways to probe them. So they have all kinds of methodologies that involve very careful control experiments, and making lots of variations on stimuli to check for robustness. They look carefully at failure modes, why the system [being tested] might fail, since those failures can give more insight into what’s going on than success.

Can you give me a concrete example of what these experimental methods look like in developmental or comparative psychology?

Mitchell: One classic example is Clever Hans. There was this horse, Clever Hans, who seemed to be able to do all kinds of arithmetic and counting and other numerical tasks. And the horse would tap out its answer with its hoof. For years, people studied it and said, “I think it’s real. It’s not a hoax.” But then a psychologist came around and said, “I’m going to think really hard about what’s going on and do some control experiments.” And his control experiments were: first, put a blindfold on the horse, and second, put a screen between the horse and the question asker. Turns out if the horse couldn’t see the question asker, it couldn’t do the task. What he found was that the horse was actually perceiving very subtle facial expression cues in the asker to know when to stop tapping. So it’s important to come up with alternative explanations for what’s going on. To be skeptical not only of other people’s research, but maybe even of your own research, your own favorite hypothesis. I don’t think that happens enough in AI.

Do you have any case studies from research on babies?

Mitchell: I have one case study where babies were claimed to have an innate moral sense. The experiment showed them videos where there was a cartoon character trying to climb up a hill. In one case there was another character that helped them go up the hill, and in the other case there was a character that pushed them down the hill. So there was the helper and the hinderer. And the babies were assessed as to which character they liked better—and they had a couple of ways of doing that—and overwhelmingly they liked the helper character better. [Editor’s note: The babies were 6 to 10 months old, and assessment techniques included seeing whether the babies reached for the helper or the hinderer.]

But another research group looked very carefully at these videos and found that in all of the helper videos, the climber who was being helped was excited to get to the top of the hill and bounced up and down. And so they said, “Well, what if in the hinderer case we have the climber bounce up and down at the bottom of the hill?” And that completely turned around the results. The babies always chose the one that bounced.

Again, coming up with alternatives, even if you have your favorite hypothesis, is the way that we do science. One thing that I’m always a little shocked by in AI is that people use the word skeptic as a negative: “You’re an LLM skeptic.” But our job is to be skeptics, and that should be a compliment.

Importance of Replication in AI Studies

Both those examples illustrate the theme of looking for counter explanations. Are there other big lessons that you think AI researchers should draw from psychology?

Mitchell: Well, in science in general the idea of replicating experiments is really important, and also building on other people’s work. But that’s sadly a little bit frowned on in the AI world. If you submit a paper to NeurIPS, for example, where you replicated someone’s work and then you do some incremental thing to understand it, the reviewers will say, “This lacks novelty and it’s incremental.” That’s the kiss of death for your paper. I feel like that should be appreciated more because that’s the way that good science gets done.

Going back to measuring cognitive capabilities of AI, there’s lots of talk about how we can measure progress towards AGI. Is that a whole other batch of questions?

Mitchell: Well, the term AGI is a little bit nebulous. People define it in different ways. I think it’s hard to measure progress for something that’s not that well defined. And our conception of it keeps changing, partially in response to things that happen in AI. In the old days of AI, people would talk about human-level intelligence and robots being able to do all the physical things that humans do. But people have looked at robotics and said, “Well, okay, it’s not going to get there soon. Let’s just talk about what people call the cognitive side of intelligence,” which I don’t think is really so separable. So I am a bit of an AGI skeptic, if you will, in the best way.

From Your Site Articles

At NeurIPS, Melanie Mitchell Says AI Needs Better Tests

Tech Life – ChatGPT prompt generates disturbing images

Engineering Is Critical to Boosting Food Security

How William Heronemus Kickstarted Wind Energy

Anthropic Blocks Foreigners From Using Mythos and Fable AI

This Researcher Trains Robots to Make Educated Guesses

Wellness Robots and the Path to Full Autonomy: A New Paradigm in AI-Powered Senior Care

Chicago mayor signs order with blueprint for fighting a potential Trump crackdown

US, allies release joint statement supporting Panama’s sovereignty

US rejects suggestion that Hamas has agreed to a Gaza ceasefire proposal

James Harden adds another chapter to ugly Game 7 history

Chaos Erupts After Macaulay Culkin Places Wife On Disney Mount Rushmore

Editors Picks

Inflation, Kevin Warsh take the stage at Fed’s rate meeting

Kevin Warsh And The End Of The Powell Era

Armie Hammer Reflects On His Public Downfall

IEA sees gradual Hormuz recovery tipping into significant 2027 oil surplus

At NeurIPS, Melanie Mitchell Says AI Needs Better Tests

Current Challenges in Evaluating AI Cognition

Importance of Replication in AI Studies

Keep Reading