Stories by mustaphah

The cost YAGNI was never about

Writing code vs. shipping code [pdf]

Trust Factory

LLMs pass a standard three-party Turing test

The small sample trap in A/B testing

Tell HN: Claude two rate limits don't know about each other

Enhancing gut-brain communication reversed cognitive decline in aging mice

Many SWE-bench-Passing PRs would not be merged

AGI is an unscientific myth

Web Verbs

OpenAI's 5-month experiment: building a product with no human-written code

SkillsBench: Benchmarking how well agent skills work across diverse tasks

Evaluating AGENTS.md: are they helpful for coding agents?

Curosr: Expanding our long-running agents research preview

Measuring Time Horizon Using Claude Code and Codex

SWE-ContextBench: context learning benchmark in coding

SWE-AGI: benchmarking spec-driven software construction

Code Formatting Silently Consumes Your LLM Budget

Agent Trace by Cursor: open spec for tracking AI-generated code

METR releases Time Horizon 1.1 with 34% more tasks

1 2 3 4 5