Stay Ahead with Tech Waves

AI Agents Competition: ChatGPT Outperforms Google's Claude and DeepSeek

Improved AI performances are evident, with ChatGPT's o3 model leading the pack. However, significant challenges persist, and current AI solutions, despite advancements, are outmatched by the top human researchers in generative AI research.

, and Administrator

2025 May 16 . 5:17 AM

2 min read

Rankings of Artificial Intelligence Agents

AI Agents Competition: ChatGPT Outperforms Google's Claude and DeepSeek

Top AI models go head-to-head in a real-world research test, but they're still miles behind human researchers. Research firm, FutureSearch, put 11 major language models through their paces on a variety of messy, research tasks, with 89 tasks in total. The performance of these models was evaluated based on their ability to find sources, gather data, verify claims, and more.

The standout performer was OpenAI's o3, scoring 0.51 on the scale, with an estimated "perfect" agent hitting around 0.8. While this shows that the best AI agents available today can outpace humans, they're still far behind in comparison. The study stated that frontier agents "substantially underperform smart generalist researchers given ample time."

Here's how the various AI models stacked up:

o3 (OpenAI) - 0.51
Claude 3.7 Sonnet (Think) & Claude 3.7 Sonnet (Std) - Tied at 0.48
Gemini 2.5 Pro - 0.45
GPT-4.1L - 0.42
DeepSeek-R1 - 0.31
Mistral Small - 0.30
GPT-4 Turbo - 0.27
Gemma 3 - 0.20

Despite the obvious limitations, AI agents are on the up. A year ago, ChatGPT -4-Turbo scored 0.27, and researchers suggest that nearly 45% of the gap between human researchers and frontier agents has been closed within that year. Interestingly, free or cheap agents like DeepSeek are hot on the heels of top-end AI agents from OpenAI.

While AI agents have made significant progress, they still lag behind human researchers, especially when it comes to strategic planning, thoroughness, evaluating sources, and "memory management." A common issue is "satisficing," where AI agents settle for a lower quality response instead of optimizing for the highest-quality answer.

ChatGPT's o3 model came out on top due to its thorough validation of answers and its tendency to stop short of superior answers less frequently.

However, experts warn that there's no straight-line path to improvement, and it's crucial to double-check results from generative AI applications like AI agents to ensure accuracy. While AI's rapid advancements are impressive, we're not quite at the point where we can fully trust it to outperform even the best of us. Yet.

Technology continues to push boundaries as AI agents like ChatGPT's o3 make notable improvements. However, they still struggle to match human researchers in areas such as strategic planning, thoroughness, and source evaluation, often settling for lower-quality responses due to satisficing.

Latest

In this image there are people in a shop, the shop is covered with iron sheet, on the top there is...

Harnessing Tech Waves' Cloud Power

Physical Layer Visibility Crucial for US Organizations' Security and Compliance

Lack of hardware asset visibility puts US organizations at risk. Physical layer visibility ensures security, compliance, and operational efficiency.

, and Administrator

2025 October 9

there was a room in which people are sitting in the chairs,in front of a table looking into the...

Harnessing Tech Waves' Cloud Power

Optus Faces Major Legal Challenge Over Massive Privacy Breach

Optus faces a major legal test over its handling of the recent privacy breach. Millions of customers' personal details were exposed, and now, a representative complaint could see them compensated.

, and Administrator

2025 October 9