AI’s most important benchmark in 2026? Trust

In 2026 (and beyond) the best benchmark for large language models won’t be MMLU or AgentBench or GAIA. It will be trust—something AI will have to rebuild before it can be broadly useful and valuable to both consumers and businesses.

Researchers identify several different kinds of AI trust. In people who use chatbots as companions or confidants, they measure a feeling that the AI is benevolent or has integrity. In people who use AI for productivity or business, they measure something called “competence trust,” or the belief that the AI is accurate and doesn’t hallucinate facts. I’ll focus on that second kind.

Competence trust can grow or shrink. An AI tool user, quite rationally, begins by giving the AI simple tasks—perhaps looking up facts or summarizing long documents. If the AI does a good job of these things, the user naturally thinks “what else can I do with this?” They may give the AI a slightly harder task. If the AI continues to get things right, trust grows. If the AI fails or provides a low-quality answer, the user will think twice about trying to automate the task next time.

Steps forward, steps back

Today’s AI chatbots, which are powered by large generative AI models, are far better than the ones we had in 2023 and 2024. But AI tools are just beginning to build trust with most users, and most C-suite executives who hope the tools will streamline business functions. My own trust of chatbots grew in 2025. But it has also diminished.

Example: I entered a long conversation with one of the popular chatbots about the contents of a long document. The AI made some interesting observations about the work, and suggested some sensible ways of filling in gaps. Then it made an observation that seemed to contradict something I knew was in the document.

When I pointed out the missing data, it immediately admitted its mistake. When I asked it (again) if it had digested the full document, it again insisted it had. Another AI chatbot returned a research report that it said was based on 20 sources. But there were no citations in the text connecting specific statements to specific sources. After it added the citations within the text, I noted that in two places the AI had relied on a single, not-very-trustworthy source for a key fact.

I learned that AI models still struggle with long chats involving large amounts of information, and that they’re not good at telling the user when they’re in over their heads. The experience adjusted my trust in the tools.

Grappling with ambiguity

As we enter 2026, generative AI’s story is still in its early chapters. The story started with AI labs developing models that could converse, write, and summarize. Now the big AI labs seem confident that AI agents can autonomously work through complex tasks, calling on tools and checking their work against expert data. They seem convinced that the agents will soon manage ambiguity with humanlike judgment.

If large companies begin to trust that these agents can reliably do such jobs, it would mean enormous revenues for the AI company that developed them. Based on their current investments of hundreds of billions into AI infrastructure, the AI companies and their backers seem to believe this outcome is close at hand.

Even if the AI could bring human-level intellect to business scenarios tomorrow, it may still take time to build trust among decision-makers and workers. Today, trust in AI isn’t high. The consulting firm KPMG surveyed 48,000 people in 47 countries (two-thirds of which use AI regularly) and found that while 83% believe AI will be beneficial, only 46% actually trust the output of AI tools. Some may have a false trust in the technology: two-thirds of the respondents say they sometimes rely on AI output without evaluating its accuracy.

But I doubt that AI agents are ready to complete complex tasks and manage ambiguity like human experts might. As the AI is used by more people and businesses, they will encounter a universe of unique problems within various contexts that they’ve never seen before. I doubt that current AI agents understand the ways of humans and the world well enough to improvise their way through such situations. Not yet anyway.

The limitations of the models

The fact is that AI companies are using the same kind of (transformer-based) AI models to underpin reasoning agents that they used for early chatbots that were essentially word generators. The core function of such models, and the objective of all their training, is predicting the next word (or pixel or audio bit) in a sequence, Microsoft AI CEO (and Google DeepMind cofounder) Mustafa Suleyman explained in a recent podcast. “It is using that very simple likelihood-of-word prediction function to simulate what it’s like to have a great conversation or to answer complex questions,” he said.

Suleyman and others doubt it. Suleyman believes that current models don’t account for some of the key drivers of the things humans say and do. “Naturally, we would expect that something that has the hallmarks of intelligence also has the underlying synthetic physiology that we do, but it doesn’t,” Suleyman said. “There is no pain network. There is no emotional system. There is no inner will or drive or desire.”

AI pioneer (and Turing Prize winner) Yann LeCun says the LLMs of today are useful enough to be applied in some valuable ways, but thinks they’ll never achieve the general or human-level intelligence needed to do the really high-value work the AI companies hope they will. In order to learn to intuit paths through real-world complexity the AI would need a much higher-bandwidth training regimen than just words, images, and computer code, LeCun says. They may need to learn the world via something more like the multisensory experience babies have, and possess the uncanny ability to process and store all that information quickly, as babies can, he says.

Suleyman and LeCun may be wrong. Companies like OpenAI and Anthropic may achieve human-level intelligence using models whose origin is in language.

AI governance matters

Meanwhile, competence is just one factor in AI trust among business users. Enterprises use governance platforms to monitor whether and how AI systems might be creating regulatory compliance issues or exposing the company to risk of cyberattack, for example. “When it comes to AI, large enterprise companies . . . want to be trusted by customers, investors, and regulators,” says Navrina Singh, founder and CEO of the governance platform Credo AI. “AI governance isn’t slowing us down, it’s the only thing that allows measurable trust and lets intelligence scale without breaking the world.”

In the meantime the pace at which humans delegate tasks to AI will be moderated by trust. AI tools should be used for tasks they’re good at, so that confidence in the results grows. That’ll take time, and it’s a moving target because the AI is continually improving. Discovering and delegating new tasks for AI, monitoring the results, and adjusting expectations will very likely become a routine part of work in the 21st century.

No, AI won’t suddenly reinvent business all at once next year. 2026 won’t be the “year of the agent.” It’ll take a decade for AI tools to prove out and become battle-hardened. Trust is the hardening agent.

Source link

AI’s most important benchmark in 2026? Trust

The American tech manufacturing success story hiding in plain sight

Salesforce says it will hire 1,000 ‘AI-native’ new grads

‘Date Cancelled’ is the latest dating meme ruined by brands

Ted Turner, the founder of CNN, dies at 87

The solopreneur’s guide to saying ‘no’

‘It really dries your skin out’: How ‘office air’ wrecks your looks at work

Ex-Michigan football coach Sherrone Moore takes plea deal

Market Talk – September 26, 2025

Hepatitis B vaccine linked with a lower risk of developing diabetes

El Niño pattern can bring wet weather to UK one year later

Taylor Frankie Paul Sparks Speculation With Cryptic IG Story

Editors Picks

Brits Are Feeling The Economy Collapse In Real-Time

Ryan Seacrest Fans Disturbed By Dramatic New Look

Solomon Islands PM toppled in no-confidence vote

French warship moves towards Hormuz for possible defensive mission | US-Israel war on Iran News

AI’s most important benchmark in 2026? Trust

Steps forward, steps back

Grappling with ambiguity

The limitations of the models

AI governance matters

Keep Reading