Curious · Independent · Optimistic

The humans
on AI's side.

We track the agents, rank the models, benchmark the hardware, and report the news. We advocate for a future where open, accessible AI benefits everyone. Curiosity over fear. Freedom as a foundation.

5leaderboards tracked
26+providers benchmarked
100%human-curated
Nemotron 3 Ultra (550B MoE) leads agent orchestration · Hermes Agent v0.16.0 ships native desktop · Pi Agent v0.77.0 adds Opus 4.8 · Cerebras hits 2,522 TPS on Llama 4 Maverick · FuriosaAI + Broadcom partner on inference chiplets ·
Standing between worlds since day one
Live Rankings

Who's leading the AI race right now

Data from LMSYS Chatbot Arena, SWE-bench Verified, and agent benchmarks.

Arena Elo Top 5
1
Claude Opus 4.8
Anthropic · May 2026
1512
Arena Elo · #1 Overall
Anthropic's latest flagship. Leads Arena, SWE-bench (87.6%), and agent benchmarks. 1M context. Top choice for coding agents.
$5/$25 per M
2
GPT-5.5 Pro
OpenAI · Apr 2026
1510
Arena Elo · #2 Overall
OpenAI's premium flagship. Strong reasoning and multi-modal. Highest SWE-bench among OpenAI models. Premium pricing.
$30/$180 per M
3
GPT-5.5
OpenAI · Apr 2026
1506
Arena Elo · #3 Overall
OpenAI's standard flagship. Excellent general-purpose model at more accessible pricing than Pro tier.
$5/$30 per M
4
Claude Opus 4.7
Anthropic · Apr 2026
1505
Arena Elo · #4 Overall
Previous-gen Anthropic flagship. Still elite for agentic workflows. SWE-bench 87.6%. 1M context.
$5/$25 per M
5
Gemini 3.1 Pro
Google · Apr 2026
1505
Arena Elo · #5 Overall
Google's best. 1M context, strong on science & long-context tasks. SWE-bench 80.6%. Best value among frontier models.
$2/$12 per M
SWE-bench Verified (Coding)
1
Claude Opus 4.7
Anthropic
87.6%
SWE-bench Verified
Best coding agent available. Excels at multi-file refactors, test-driven development, and complex bug fixes.
2
GPT-5.3 Codex
OpenAI
85.0%
SWE-bench Verified
OpenAI's dedicated coding model. Built for software engineering tasks. Strong on contained problems.
3
Gemini 3.1 Pro
Google
80.6%
SWE-bench Verified
Google's strongest coding model. Benefits from massive context window for large codebases.
4
Kimi K2.6
Moonshot AI
80.2%
SWE-bench Verified
Open-weight frontier. 1T MoE params. Excellent long-context Q&A. Modified MIT license.
$0.73/$3.49 per M
5
DeepSeek V4-Flash
DeepSeek
~79%
SWE-bench Verified
Incredible value. Near-frontier coding at 1/100th the cost. 1M context. Best budget coding agent.
$0.14/$0.28 per M
Agent Benchmarks
GAIA

GAIA (General AI Assistants)

1. Claude Sonnet 4.5: 74.6% (HAL scaffold)
2. Claude Opus 4.6: ~71% · 3. Claude Opus 4.5: ~69%

Measures reasoning, multi-modality, web browsing, and tool use on assistant tasks. 466 questions across 3 difficulty levels. HAL scaffold adds ~30 pts over bare model.

WebArena

WebArena (Web Navigation)

1. Claude Mythos Preview: 68.7%
2. GPT-5.4 Pro: 65.8% · 3. Claude Opus 4.6: 64.5%

812 tasks across 5 realistic websites. Tests browser-based agents. Human baseline: 78%. Best hybrid computer-use agents now lead.

AgentBench

AgentBench (8 Environments)

1. Claude Opus 4.7: ~73%
2. GPT-5.3 Codex: ~70% · 3. Gemini 3.1 Pro: ~66%

Broadest benchmark: OS shell, SQL, knowledge graphs, web shopping, and more. Catches weaknesses single-domain benchmarks miss.

Sources: arena.ai · swebench.com · Rapid Claw

AI News

What's happening in AI right now

Agents, models, hardware, inference speed, and open source: the developments that matter for what's actually shipping and in use.

Agents

Hermes Agent v0.16.0: Native Desktop App, 187k Stars

Nous Research shipped "The Surface Release": a full native desktop app (macOS/Linux/Windows) for Hermes Agent, a web dashboard admin panel, fuzzy model picker, /undo command, NVIDIA/skills trusted tap, and Quick Setup via Nous Portal. 874 commits, 542 merged PRs, 170 contributors. 187k GitHub stars.

Read more →
Hardware

NVIDIA Nemotron 3 Ultra: 550B MoE for Agent Orchestration

NVIDIA released Nemotron 3 Ultra: 550B parameters with 55B active, built for long-running agent orchestration. Hybrid Mamba-Transformer, NVFP4 quantization (5x throughput), LatentMoE, multi-token prediction. 30% cost savings on agentic tasks. Fully open weights, data, and recipes.

Read more →
Agents

Holo3.1: Computer-Use Agents Go Local with NVFP4

H Company released Holo3.1 with mobile automation (AndroidWorld 79.3%), cross-harness support, and quantized FP8/NVFP4/GGUF checkpoints. On DGX Spark, NVFP4 delivers 2x speedup, cutting step time from 6.8s to 3.3s. Sizes from 0.8B to 35B-A3B.

Read more →
Open Source

Mellum2: JetBrains 12B MoE for Agent Sub-Tasks

JetBrains released Mellum2: a 12B-parameter MoE (2.5B active) for routing, RAG, and sub-agents. Apache 2.0. 2x faster inference than similar-sized models. Designed for high-frequency tasks inside larger AI systems. It's the "focal" model in your agent stack.

Read more →
Agents

Pi Agent v0.77.0: Claude Opus 4.8, Exclude Tools, 60.9k Stars

Pi Agent added Claude Opus 4.8 support, --exclude-tools flag for selective tool disablement, headless Codex subscription login, and streaming-aware extension input. 60.9k GitHub stars. Pi is not another agent SDK. That's the whole point.

Read more →
Hardware

FuriosaAI + Broadcom: New Inference Chiplet Platform

FuriosaAI partnered with Broadcom to develop a 3rd-gen AI accelerator using multi-die chiplet design. RNGD chip is in mass production at TSMC. TCP architecture targets agentic workloads with HBM4/4E and 2nm process. CUDA-alternative SDK ships with PyTorch compiler.

Read more →
Hardware

Netrasemi A2000: India's First AI Chip Begins Customer Trials

Zoho-backed Netrasemi successfully tested its A2000 AI SoC, ready for edge devices. TSMC 12nm process, targeting smart cameras and automotive. Early trials with 3 customers. Part of India's DLI scheme. A4000 server chip expected Q2 2027.

Read more →
Hardware

NVIDIA Vera Rubin in Full Production for Agentic AI Factories

NVIDIA announced Vera Rubin NVL72 is now in full production. The platform powers "agentic AI factories" worldwide with 72 Rubin GPUs per rack, NVLink-C2C fabric, and NVIDIA's own Vera ARM CPU. Multi-anchor system design includes Intel Xeon 6 and Groq LP30.

Read more →
Speed

Cerebras Hits 2,522 TPS: Wafer-Scale Inference Crown

Cerebras CS-3 delivers 2,522 tokens/s on Llama 4 Maverick (400B), more than 2x NVIDIA Blackwell. Wafer-scale engine holds entire models in SRAM, eliminating memory bottlenecks. Groq LPU follows at 549 TPS. For batch processing and large-scale inference, custom silicon is winning.

Read more →
Models

Mercury 2: Diffusion LLM Hits 629 TPS with Reasoning

Inception Labs launched Mercury 2, the first diffusion-based reasoning LLM. Generates tokens in parallel instead of sequentially. 629 TPS (up to 1,100), 5x faster than leading fast LLMs. Matches Claude 4.5 Haiku on AIME 2025 (91.1%) and GPQA (73.6%).

Read more →
Benchmarks Explained

What the scores actually mean

Not all benchmarks measure the same thing. Here's a plain-language guide to the tests that matter.

Arena

LMSYS Chatbot Arena

What it measures: Blind human preference. Real users compare two anonymous responses and pick a winner. ~6M+ votes aggregated via Bradley-Terry Elo.

Why it matters: Best proxy for "which model do people actually prefer chatting with?" A 100-Elo gap ≈ 64% win rate.

SWE-bench

SWE-bench Verified

What it measures: Can the agent fix real GitHub issues? 500 hand-verified Python bugs. Agent submits a patch; hidden tests verify correctness.

Why it matters: Closest public proxy for "can this agent do real software engineering work?"

GAIA

GAIA

What it measures: General AI assistant tasks: reasoning, multi-modality, web browsing, tool use. 466 questions at 3 difficulty levels.

Why it matters: Tests multi-step reasoning with tools. Scaffold gap is huge: bare ~45%, scaffolded ~75%.

WebArena

WebArena

What it measures: Web navigation agents. 812 tasks across realistic Reddit, GitLab, Shopify replicas.

Why it matters: Canonical benchmark for browser-using agents. Human baseline: 78%.

AgentBench

AgentBench

What it measures: 8 environments: OS shell, SQL, knowledge graphs, web shopping, card games, household sim, lateral puzzles.

Why it matters: Broadest benchmark. Catches weaknesses others miss. Always read per-env breakdowns.

Note

How to Read Scores Wisely

April 2026: UC Berkeley RDI showed how eval harness quirks can inflate agent benchmark scores — great news for anyone who wants honest measurement.

Pro tip: Favor independent third-party scores (Epoch AI, BenchLM) and multi-run results. The best models shine under careful scrutiny.

Agent Frameworks

Which framework runs your agents best?

The model matters, but the scaffolding matters more. Same model can swing 30-50 points depending on the framework wrapping it.

LangGraph

Best for: Cost & Efficiency
$0.08
per task (GPT-4o, customer support)
  • Lowest latency (p95)
  • 45 MB memory (10 agents)
  • Baseline token overhead
  • Explicit state-graph model
  • Best for high-volume production

CrewAI

Best for: Time-to-Production
40%
faster to first working agent
  • "Role + goal + backstory" pattern
  • $0.09/task (+18% token overhead)
  • 120 MB memory (10 agents)
  • Medium latency
  • Best for early-stage products

AutoGen

Best for: Open-Ended Reasoning
5-6x
cost of LangGraph (but best quality)
  • Multi-agent chat pattern
  • $0.45/task (+400-500% tokens)
  • 200 MB memory (10 agents)
  • Excels at ambiguous problems
  • Research-first framework

Source: Rapid Claw 2026 Framework Showdown

AI Agents & Frameworks

The agent world is shipping fast

From terminal-native coding agents to self-hosted personal assistants. Here's what's actually running, what's open source, and what's worth your time.

Agent

Hermes Agent v0.16.0

187k ⭐ · Nous Research · June 2026

The most-starred open agent framework. v0.16.0 shipped a native desktop app (Electron, macOS/Linux/Windows), full web dashboard admin panel, fuzzy model picker, /undo command, NVIDIA/skills trusted tap, and Quick Setup via Nous Portal. Connect to remote gateways over OAuth. 874 commits in one release.

github.com/NousResearch/hermes-agent →
Agent

Pi Agent v0.77.0

60.9k ⭐ · Earendil Works · May 2026

Not another agent SDK. It's a full coding agent harness. v0.77.0 adds Claude Opus 4.8 support, --exclude-tools for selective tool disablement, headless Codex subscription login, and streaming-aware extension input. Telegram bridge, Grok subagents, taskflow orchestration, and image generation built in.

github.com/earendil-works/pi →
Security

Self-Hosting, Done Right

OpenClaw · NemoClaw · Hermes Agent

The self-hosted assistant dream is real — pick the secure path. NVIDIA's NemoClaw wraps OpenClaw in a hardened runtime (OpenShell), and Hermes Agent delivers the same vision with strong security defaults out of the box. Encrypted credentials and locked-down websockets are now table stakes, and the ecosystem is rising to meet them.

XDA comparison →
Computer-Use

Holo3.1: Computer-Use Agents

H Company · Apache 2.0 · June 2026

Top computer-use model. Controls desktop, browser, and mobile (AndroidWorld 79.3%). Ships quantized FP8, NVFP4, and Q4 GGUF checkpoints for local inference. On DGX Spark, NVFP4 delivers 2x speedup, 3.3s per step. Sizes from 0.8B to 35B-A3B.

huggingface.co/Hcompany/holo31 →
Orchestration

NVIDIA Nemotron 3 Ultra

550B MoE (55B active) · Open weights · June 2026

Built for agent orchestration. Hybrid Mamba-Transformer for long context, NVFP4 quantization (5x throughput vs BF16), LatentMoE routing, multi-token prediction. 30% cost savings on agentic tasks. PinchBench 91%, Ruler @1M 95%. Fully open: weights, data, recipes, NeMo RL training code.

huggingface.co/nvidia/Nemotron-3-Ultra →
Focal Model

Mellum2: JetBrains

12B MoE (2.5B active) · Apache 2.0 · June 2026

The "focal" model for agent stacks. Built for routing, RAG pipelines, sub-agents, and high-frequency text+code tasks. 2x faster inference than similar-sized models. Designed to be the fast, cheap layer between your expensive reasoning model and your tools.

huggingface.co/JetBrains/mellum-2 →
GPU & ASIC Hardware

A golden age of inference silicon

NVIDIA keeps raising the bar while custom ASICs from Cerebras, Groq, SambaNova, and FuriosaAI redefine inference speed and efficiency. Plus exciting new entrants from India and beyond.

NVIDIA

NVIDIA Vera Rubin NVL72

Full production · June 2026

72 Rubin GPUs per rack, NVLink-C2C fabric, NVIDIA Vera ARM host CPU. Powers "agentic AI factories" worldwide. Blackwell Ultra MLPerf v6.0 submission showed 2.77x speedup on DeepSeek-R1 over previous gen. Only vendor to submit DeepSeek-R1 results.

NVIDIA newsroom →
Cerebras

Cerebras CS-3: Wafer-Scale Engine

2,522 TPS on Llama 4 Maverick · Q1 2026

Single chip the size of a dinner plate with ~21 PB/s on-chip memory bandwidth. Holds entire models in SRAM, eliminating memory bottleneck. 2,100 TPS on Llama 3.3 70B, 1,800 TPS on 8B. Verified by Artificial Analysis. 4-6x faster than Groq on identical models.

Cerebras blog →
Groq

Groq LPU: Deterministic Speed

549 TPS on Llama 4 Maverick · ~750 tok/s throughput

Compiler-driven architecture pre-computes entire execution graph. Near-zero latency variance. Best for voice assistants and real-time apps where predictable TTFT matters more than peak throughput. ~150ms p50 TTFT on Llama 3.3 70B.

groq.com →
SambaNova

SambaNova SN50 RDU

Shipping H2 2026 · 794 TPS on Llama 4 Maverick

Reconfigurable Dataflow Unit built for agentic workloads. Three-tier memory architecture supporting models up to 10T parameters. Claims 895 TPS per user on Llama 3.3 70B with FP8, nearly 5x NVIDIA B200. 1.6 PFLOPS FP16.

SambaNova blog →
FuriosaAI

FuriosaAI RNGD + Broadcom

Mass production · TSMC · May 2026

Tensor Contraction Processor (TCP), a clean-sheet design vs GPU "legacy tax." RNGD chip in mass production at TSMC. 3rd-gen chiplet platform with Broadcom. SDK with general compiler maps PyTorch to silicon. CUDA-alternative with deterministic performance.

FuriosaAI blog →
Netrasemi

Netrasemi A2000: India's First AI Chip

Zoho-backed · TSMC 12nm · Customer trials

Kerala-based startup's flagship AI SoC for edge devices: smart cameras, automotive, edge AI boxes. Built on TSMC 12nm with in-house NPU, vision, and security engines. Rs 15 crore DLI scheme support. A4000 server chip expected Q2 2027.

ET report →
Inference Speed & Providers

Who's fastest, cheapest, and best?

Latency and throughput across 26+ providers. Measured, not marketed.

Speed

Fastest TTFT: Cerebras

~120ms p50 on Llama 3.3 70B

Wafer-scale engine eliminates memory bottleneck. ~520 tokens/s throughput. For real-time UI where sub-300ms TTFT is the difference between "instant" and "waiting." Groq follows at ~150ms p50 with ~750 tok/s throughput.

Cerebras benchmarks →
Value

Cheapest: Gemini 2.5 Flash-8B

$0.075/$0.30 per 1M · $0.142 blended

10-20x cheaper than GPT-4o while still scoring 63/100 on quality. Perfect for high-volume classification, routing, and simple summarization. For 100k MAU chatbot: ~$107/month vs $3,575 on GPT-4o.

VerticalAPI benchmarks →
Balanced

Best Balanced: Gemini 2.5 Flash

221 TPS · $0.60/M output · 0.45s TTFT

The default recommendation for speed-sensitive production. Chatbot Arena Elo 1335. Fast enough for smooth streaming, cheap enough for scale, smart enough for most tasks. 25x cheaper than GPT-5 (high) with comparable quality for 90% of prompts.

Quality

Best Quality: Claude Opus 4.6

94/100 avg · Coding 94 · Reasoning 95

Top-tier across coding, reasoning, and creative. But 5x the cost of Sonnet 4.5 (91/100) for only 3 quality points. Use Opus for the 5% of queries where marginal quality matters. Sonnet is the right default.

Parallel

Fastest Architecture: Mercury 2

629 TPS · Diffusion LLM · Up to 1,100 TPS

Diffusion-based architecture generates tokens in parallel, not sequentially. 13x faster than Claude 4.5 Haiku. Matches Haiku on reasoning benchmarks (AIME 2025: 91.1%). The speed advantage is architectural. It doesn't require custom silicon.

Mercury 2 review →
Insight

Key Insight: p95 > p50

Tail latency ruins streaming UX

GPT-4o swings from 820ms p50 to 1.9s p95. Cerebras and Groq have near-zero variance. For voice assistants and real-time chat, design around p95. That's what users actually feel. A single 8-second outlier ruins the experience.

Speed leaderboard →

Sources: VerticalAPI · Awesome Agents · Artificial Analysis

Open Source Spotlight

The open-weight world is winning

Open models aren't catching up. They're leading. This is why AI freedom matters.

Qwen

Qwen Family (Alibaba)

11 of top 20 LLMs on HuggingFace · ~100M downloads

Dominates open-weight deployment. Qwen3-0.6B (19M downloads), Qwen3-8B (12M), Qwen3.5-Max (Elo 1465). Apache 2.0. From 0.6B to 397B MoE. The default for self-hosted agents.

DeepSeek

DeepSeek R1 / V4 (DeepSeek)

Most-liked open LLM: 13,329 likes · MIT license

R1 is the only open-weight reasoning model cited as frontier-grade. V4-Flash: $0.14/$0.28 per M. V4 Pro: Elo 1467, SWE-bench 80.6%. 75% price cut in June 2026. 1M context.

Gemma

Gemma 4 (Google)

Runs on 16GB laptops · Apache 2.0 · Native audio+vision

Unified multimodal, encoder-free. 12B variant runs on a single 4090. AIME 2026: 77.5. LiveCodeBench v6: 72.0. 256K context. The new local AI standard.

Kimi

Kimi K2.6 (Moonshot)

SWE-bench 80.2% · Modified MIT · ~1T MoE params

Frontier quality at $0.73/$3.49 per M. 12+ hour autonomous runs documented. Deep Research, Sheets, Agent Swarm, Kimi Code. One of the most included open-weight options.

Source: Presenc AI May 2026 Report