Tag: agents

All the articles with the tag "agents".

DeepMind's Aletheia Just Cracked Open Math Research – And It's Only Level 2

16 Feb, 2026
• 1 min read

DeepMind's new agent autonomously wrote a math paper and solved Erdős conjectures – is this the dawn of AI mathematicians?

Read more
Gaia2 Benchmark Exposes Why Your Coding Agents Crumble in Real Dynamic Worlds

16 Feb, 2026
• 1 min read

GPT-5 hits 42% on Gaia2 but flops on time-sensitive tasks – the agent benchmark that breaks sacred cows.

Read more
How2Everything: 351K Web Procedures to Finally Fix Your LLM's How-To Hallucinations

16 Feb, 2026
• 1 min read

Allen AI mined 351K real how-tos from the web – now your LLM instructions won't suck anymore.

Read more
Kimi's Agent Swarm: 100 AI Workers Crushing Complex Tasks (And It's Live)

15 Feb, 2026
• 1 min read

Kimi K2.5 unleashes ~100 sub-agents per task—your single-model bottlenecks are over.

Read more
GLM-5 Just Dropped: The Open Model Crushing Gemini at Half the Price

15 Feb, 2026
• 1 min read

744B params, tops every open benchmark, and costs just $0.80/M tokens—did Z.ai finally crack frontier performance for devs?

Read more
Z.ai's GLM-5 Just Dethroned Every Open Weights LLM (And It's Actually Usable)

13 Feb, 2026
• 1 min read

Open-source just hit a new high: GLM-5 crushes benchmarks with the lowest hallucinations ever—your next production model?

Read more
Anthropic's Claude Agents Hit Real Science Labs – TB-Scale Analysis in Hours

9 Feb, 2026
• 1 min read

Claude-powered multi-agent systems just deployed to Allen Institute: compressing months of genomics analysis into hours.

Read more
Alibaba's Qwen3-Coder-Next Just Made Coding Agents Free and Open Source

8 Feb, 2026
• 1 min read

What if your next coding agent ran locally, fixed bugs autonomously, and cost pennies to deploy? Alibaba just dropped it open-weight.

Read more
Anthropic's New Tool-Use API Lets Claude Build Your Entire App Stack - Game Changer

3 Feb, 2026
• 1 min read

Claude's tool-use API dropped today - it now autonomously calls GitHub, Vercel, Postgres, and Stripe to ship full apps from one prompt.

Read more
LLM Evaluations Just Hit 90% Accuracy - Finally Trust Your Model Benchmarks

2 Feb, 2026
• 1 min read

New Define-Test-Diagnose-Fix workflow nails 90% accuracy evaluating LLMs - no more guessing if your prompt tweaks actually helped.

Read more
Moonshot AI Just Dropped the World's Most Advanced Open-Source LLM - And It's Built for Agents

2 Feb, 2026
• 1 min read

This new open-source beast from Moonshot crushes reasoning benchmarks while sipping hardware - time to ditch your bloated closed models?

Read more
Moonshot's Kimi K2.5: The Agent That Generates Video *and* Thinks Autonomously

1 Feb, 2026
• 1 min read

Forget text-only LLMs—Kimi K2.5 builds videos from prompts and handles tasks solo, outpacing U.S. benchmarks.

Read more