Stories by bisonbear

The Opus 4.7 reasoning curve - Medium is the best default?

GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo

I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks

Coding evals are broken. CI is green while AI code quality goes unmeasured

Agents.md is the highest-leverage code you're not testing

Your AI coding benchmark is hiding a 2x quality gap

Things I Learned at the Claude Code NYC Meetup

Claude vs. Codex in the Messy Middle

Spacetime as a Neural Network

One agent isn't enough

Context Engineering: The New Skill for Working with AI Agents

The New Math of Building with AI