HN
Paper
Show
Ask
Jobs
Top
Today
Last 7 days
Last months
This year
Statistics
Show
Ask
Jobs
Top stories
Today
Last 7 days
Last months
This year
Statistics
Stories by
colinfly
What broke when I tried to evaluate an AI agent in production
1 points
colinfly
2026-03-17T17:58:49Z
news.ycombinator.com
Open-source LLM-as-judge eval suite with root cause analysis and failure mining
2 points
colinfly
2026-03-13T21:53:44Z
github.com