HN
Paper
All
Show
Ask
Jobs
Top
Today
Last 7 days
Last months
This year
Statistics
All
Show
Ask
Jobs
Top stories
Today
Last 7 days
Last months
This year
Statistics
Stories by
zone411
LLM Persuasion Benchmark: Multi-Turn Persuasion Between Models
8 points
zone411
2026-03-27T17:02:35Z
github.com
Show HN: LLM Debate Benchmark
9 points
zone411
2026-03-23T20:49:45Z
github.com
Show HN: LLM Sycophancy Benchmark: Opposite-Narrator Contradictions
3 points
zone411
2026-03-10T04:51:18Z
github.com
Show HN: LLM Round‑Trip Translation Benchmark
6 points
zone411
2025-09-15T16:20:04Z
github.com
Show HN: LLM Creative Story‑Writing Benchmark V3
8 points
zone411
2025-09-10T11:04:27Z
github.com
Show HN: Mapping LLM Style and Range in Flash Fiction
7 points
zone411
2025-09-03T11:43:50Z
github.com
Pact: Head-to-head negotiation benchmark for LLMs
6 points
zone411
2025-08-21T16:37:36Z
github.com
Show HN: Bazaar – a new LLM benchmark for economic reasoning under uncertainty
8 points
zone411
2025-07-22T17:55:18Z
github.com
AI Comes Up with Physics Experiments. But They Work
4 points
zone411
2025-07-21T14:40:32Z
www.quantamagazine.org
Emergent Price-Fixing by LLM Auction Agents
7 points
zone411
2025-07-15T11:37:28Z
github.com
Public Goods Game Benchmark: Contribute and Punish, a Multi-Agent Benchmark
7 points
zone411
2025-03-20T15:32:52Z
github.com
Elimination Game: Multi-Agent LLM Social Reasoning, Strategy, and Deception
5 points
zone411
2025-02-25T19:55:17Z
github.com
SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork
110 points
zone411
2025-02-18T05:25:05Z
arxiv.org
LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21
17 points
zone411
2025-02-10T19:09:20Z
github.com
Multi-Agent Step Race Benchmark: LLM Collaboration and Deception Under Pressure
7 points
zone411
2025-01-22T16:17:47Z
github.com
Show HN: LLM Thematic Generalization Benchmark
6 points
zone411
2025-01-14T17:32:45Z
github.com
Show HN: LLM Creative Story-Writing Benchmark
5 points
zone411
2025-01-06T16:36:03Z
github.com
Show HN: LLM Divergent Thinking Creativity Benchmark
8 points
zone411
2024-12-30T18:29:30Z
github.com
Show HN: LLM Deceptiveness and Gullibility Benchmark
7 points
zone411
2024-10-22T16:27:14Z
github.com
LLM Confabulation (Hallucination) Leaderboard
6 points
zone411
2024-10-10T15:34:33Z
github.com
1
2