By OpenAI's own testing,Arnold Reyes Archives its newest reasoning models, o3 and o4-mini, hallucinate significantly higher than o1.
First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.
SEE ALSO: All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphinsThe system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result."
OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 because they use more computation to "spend more time thinking before they respond," as described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes."
However, the system card for GPT-4.5, which was released in February, shows a 19 percent hallucination rate on the PersonQA evaluation. The same card also compares it to GPT-4o, which had a 30 percent hallucination rate.
In a statement to Mashable, an OpenAI spokesperson said, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”
Evaluation benchmarks are tricky. They can be subjective, especially if developed in-house, and research has found flaws in their datasets and even how they evaluate models.
Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard.
That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates.
Then there's the added complexity that models tend to be more accurate when tapping into web search to source their answers. But in order to use ChatGPT search, OpenAI shares data with third-party search providers, and Enterprise customers using OpenAI models internally might not be willing to expose their prompts to that.
Regardless, if OpenAI is saying their brand-new o3 and o4-mini models hallucinate higher than their non-reasoning models, that might be a problem for its users.
UPDATE: Apr. 21, 2025, 1:16 p.m. EDT This story has been updated with a statement from OpenAI.
Topics ChatGPT OpenAI
Previous:No Wokeness, No Worries
Next:Beyond Strategy
Private moon spacecraft just took a breathtaking snapshot of EarthBest coffee maker deal: Take $100 off the Keurig KSamsung Galaxy S25 Ultra handsTrump's foreign aid freeze halts funding for digital diplomacy bureauCruz Azul vs. Puebla 2025 livestream: Watch Liga MX onlineNYT Connections Sports Edition hints and answers for January 26: Tips to solve Connections #125DeepSeek says its newest AI model, JanusGet 10% off at Ulta with inNYT Connections Sports Edition hints and answers for January 28: Tips to solve Connections #127Will Oracle take over TikTok? Trump says he'll make a decision in 30 daysMark Zuckerberg announces $60 billion investment in Meta AIEagles vs. Commanders 2025 livestream: How to watch NFL onlineShop the best Roku deals and save up to 40%Best Apple Watch deal: Save $70 on Apple Watch Series 10DeepSeek AI: How to try DeepSeek R1 right nowHP Touchscreen Laptop deal: Get $240 off at Best BuySinner vs. Zverev 2025 livestream: Watch Australian Open final for freeHP Touchscreen Laptop deal: Get $240 off at Best BuyTikTok ban: Influencers brace for an uncertain futureNYT mini crossword answers for January 25, 2025 'Wonder Woman' gets her name in a new TV spot Cry of the week: Offred loses her family on 'The Handmaid's Tale' Lady Gaga gave the '13 Reasons Why' stars a huge surprise on Ellen Facebook accused of targeting 'insecure' children and young people, report says 'Game of Thrones' got some bad reviews back in the day, gets the last laugh 'Guardians of the Galaxy, Vol. 2' already crosses $100 million at the box office Uber finally lets you delete Uber without having to beg The Tesla Model 3 dashboard sure looks nice Amazon wants to make Alexa sound like your human friends Here's the official postmortem statement from Fyre Festival organizers American Gods: Who is Wednesday? Meet the deities in Starz's drama Nintendo isn't done with handhelds The Gabby Douglas #Shero doll is the Barbie we desperately need Unroll.me's shadiness is exactly why people don't trust tech companies 'Game of Thrones' actor shares more on why Jaime could kill Cersei 'Fixer Upper' star takes to Twitter to address lawsuit Donald Trump is going to circle up the tech titans again Hey Radiohead, please tell us what this mysterious video means Yes, this Mets score is real If you want to find real love, find yourself someone who loves HGTV
1.2936s , 10130.453125 kb
Copyright © 2025 Powered by 【Arnold Reyes Archives】,Exquisite Information Network