Google’s Gemini 3 model keeps the AI hype train going – for now

Gemini 3 is Google’s latest AI model

VCG via Getty Images

Google’s latest chatbot, Gemini 3, has made significant leaps on a raft of benchmarks designed to measure AI progress, according to the company. These achievements may be enough to allay fears of an AI bubble bursting for the moment, but it is unclear how well these scores translate to real-world capabilities.

What’s more, persistent factual inaccuracies and hallucinations that have become a hallmark of all large language models show no signs of being ironed out, which could prove problematic for any uses where reliability is vital.

In a blog post announcing the new model, Google bosses Sundar Pichai, Demis Hassabis and Koray Kavukcuoglu write that Gemini 3 has “PhD-level reasoning”, a phrase that competitor OpenAI also used when it announced its GPT-5 model. As evidence for this, they list scores on several tests designed to test “graduate-level” knowledge, such as Humanity’s Last Exam, a set of 2500 research-level questions from maths, science and the humanities. Gemini 3 scored 37.5 per cent on this test, outclassing the previous record holder, a version of OpenAI’s GPT-5, which scored 26.5 per cent.

Jumps like this can indicate that a model has become more capable in certain respects, says Luc Rocher at the University of Oxford, but we need to be careful about how we interpret these results. “If a model goes from 80 per cent to 90 per cent on a benchmark, what does it mean? Does it mean that a model was 80 per cent PhD level and now is 90 per cent PhD level? I think it’s quite difficult to understand,” they say. “There is no number that we can put on whether an AI model has reasoning, because this is a very subjective notion.”

Benchmark tests have many limitations, such as requiring a single answer or multiple choice answers for which models don’t need to show their working. “It’s very easy to use multiple choice questions to grade [the models],” says Rocher, “but if you go to a doctor, the doctor will not assess you with a multiple choice. If you ask a lawyer, a lawyer will not give you legal advice with multiple choice answers.” There is also a risk that the answers to such tests were hoovered up in the training data of the AI models being tested, effectively letting them cheat.

The real test for Gemini 3 and the most advanced AI models – and whether their performance will be enough to justify the trillions of dollars that companies like Google and OpenAI are spending on AI data centres – will be in how people use the model and how reliable they find it, says Rocher.

Google says the model’s improved capabilities will make it better at producing software, organising email and analysing documents. The firm also says it will improve Google search by supplementing AI-generated results with graphics and simulations.

It is likely that the real improvements will be for people who use AI tools to autonomously write code, a process called agentic coding, says Adam Mahdi at the University of Oxford. “I think we’re hitting the upper limit of what a typical chatbot can do, and the real benefits of Gemini 3 Pro [the standard version of Gemini 3] will probably be in more complex, potentially agentic workflows, rather than everyday chatting,” he says.

Initial reactions online have included people praising Gemini’s coding capabilities and ability to reason, but as with all new model releases, there have also been posts highlighting failures to do apparently simple tasks, such as tracing hand-drawn arrows pointing to different people, or simple visual reasoning tests.

Google admits, in Gemini 3’s technical specifications, that the model will continue to hallucinate and produce factual inaccuracies some of the time, at a rate that is roughly comparable with other leading AI models. The lack of improvement in this area is a big concern, says Artur d’Avila Garcez at City St George’s, University of London. “The problem is that all AI companies have been trying to reduce hallucinations for more than two years, but you only need one very bad hallucination to destroy trust in the system for good,” he says.

Topics:

Source link

Google’s Gemini 3 model keeps the AI hype train going – for now

‘Green’ cryptocurrency uses 18 times more energy than makers claim

Doubts cast over ‘wild’ claim that magnetic control can turn on genes

We have figured out a new way to send messages into the past

10,000 new planets found hidden in NASA telescope data

Weird ‘transdimensional’ state of matter is neither 2D nor 3D

Simple treatment tweak drastically reduces blood loss from severe cuts

Branded Hospitality: Where Strategy Meets Shtick

Surprising new biography of Francis Crick unravels the story of DNA

UK says several proposals for Ukraine ceasefire on table after France floats one-month truce

Migrants returning to Venezuela face debt and harsh living conditions

Feds Spent At Least $267 Million Spent on Fighting “Misinformation”

Editors Picks

Market Talk – April 30, 2026

Jinger Duggar Talks ‘Old Wounds’ Amid Brother’s Abuse Charges

‘Everyone wants to come to China now’: Chinese universities draw more foreign students beyond Western countries

First US-Venezuela flight lands in Caracas after seven-year suspension | Aviation News

Google’s Gemini 3 model keeps the AI hype train going – for now

Keep Reading