Skip to content

AI Agents Competition: ChatGPT Outperforms Google's Claude and DeepSeek

Improved AI performances are evident, with ChatGPT's o3 model leading the pack. However, significant challenges persist, and current AI solutions, despite advancements, are outmatched by the top human researchers in generative AI research.

Rankings of Artificial Intelligence Agents
Rankings of Artificial Intelligence Agents

AI Agents Competition: ChatGPT Outperforms Google's Claude and DeepSeek

Top AI models go head-to-head in a real-world research test, but they're still miles behind human researchers. Research firm, FutureSearch, put 11 major language models through their paces on a variety of messy, research tasks, with 89 tasks in total. The performance of these models was evaluated based on their ability to find sources, gather data, verify claims, and more.

The standout performer was OpenAI's o3, scoring 0.51 on the scale, with an estimated "perfect" agent hitting around 0.8. While this shows that the best AI agents available today can outpace humans, they're still far behind in comparison. The study stated that frontier agents "substantially underperform smart generalist researchers given ample time."

Here's how the various AI models stacked up:

  1. o3 (OpenAI) - 0.51
  2. Claude 3.7 Sonnet (Think) & Claude 3.7 Sonnet (Std) - Tied at 0.48
  3. Gemini 2.5 Pro - 0.45
  4. GPT-4.1L - 0.42
  5. DeepSeek-R1 - 0.31
  6. Mistral Small - 0.30
  7. GPT-4 Turbo - 0.27
  8. Gemma 3 - 0.20

Despite the obvious limitations, AI agents are on the up. A year ago, ChatGPT -4-Turbo scored 0.27, and researchers suggest that nearly 45% of the gap between human researchers and frontier agents has been closed within that year. Interestingly, free or cheap agents like DeepSeek are hot on the heels of top-end AI agents from OpenAI.

While AI agents have made significant progress, they still lag behind human researchers, especially when it comes to strategic planning, thoroughness, evaluating sources, and "memory management." A common issue is "satisficing," where AI agents settle for a lower quality response instead of optimizing for the highest-quality answer.

ChatGPT's o3 model came out on top due to its thorough validation of answers and its tendency to stop short of superior answers less frequently.

However, experts warn that there's no straight-line path to improvement, and it's crucial to double-check results from generative AI applications like AI agents to ensure accuracy. While AI's rapid advancements are impressive, we're not quite at the point where we can fully trust it to outperform even the best of us. Yet.

Technology continues to push boundaries as AI agents like ChatGPT's o3 make notable improvements. However, they still struggle to match human researchers in areas such as strategic planning, thoroughness, and source evaluation, often settling for lower-quality responses due to satisficing.

Read also:

    Latest