Which chatbot is best at simple maths? A new study delivers a sobering answer.
Researchers tested five leading AI chatbots using 500 everyday maths questions. The results suggest users should be careful. Overall, there is a 40 per cent chance an AI will give the wrong answer.
The study, called ORCA, examined tools many people use daily. These include ChatGPT, Gemini, Grok, Claude, and DeepSeek. Tests were carried out in October 2025 using identical questions.
Gemini leads, but still struggles
No model scored above 63 per cent overall. Google’s Gemini ranked first with 63 per cent accuracy. Grok followed closely with 62.8 per cent. DeepSeek scored 52 per cent, while ChatGPT reached 49.4 per cent. Claude came last at 45.2 per cent.
The average accuracy across all models was just 54.5 per cent. That means nearly half of all answers were wrong.
Big gaps between subjects
Performance varied sharply by topic. Basic maths and unit conversions delivered the strongest results. Gemini led here with 83 per cent accuracy.
Physics proved the hardest area. Average accuracy dropped to 35.8 per cent. Even the best models failed more than half the time.
DeepSeek performed especially poorly in biology and chemistry. It answered correctly in only about one in ten cases.
Common mistakes identified
Most errors were simple calculation mistakes or rounding problems. Others showed faulty logic or misunderstood questions. Some chatbots even refused to answer.
Clear warning for users
Researchers urge caution, especially for important tasks. Experts advise double-checking results with calculators or trusted sources. AI may be fast, but it is far from reliable.