Today Last 7 days Last months This year

Top stories

Today Last 7 days Last months This year

Stories by colinfly

What broke when I tried to evaluate an AI agent in production

1 points
colinfly
2026-03-17T17:58:49Z
news.ycombinator.com

Open-source LLM-as-judge eval suite with root cause analysis and failure mining

2 points
colinfly
2026-03-13T21:53:44Z
github.com

HN Paper by Régis Gaidot - Statistics