Stories by zone411

LLM Persuasion Benchmark: Multi-Turn Persuasion Between Models

Show HN: LLM Debate Benchmark

Show HN: LLM Sycophancy Benchmark: Opposite-Narrator Contradictions

Show HN: LLM Round‑Trip Translation Benchmark

Show HN: LLM Creative Story‑Writing Benchmark V3

Show HN: Mapping LLM Style and Range in Flash Fiction

Pact: Head-to-head negotiation benchmark for LLMs

Show HN: Bazaar – a new LLM benchmark for economic reasoning under uncertainty

AI Comes Up with Physics Experiments. But They Work

Emergent Price-Fixing by LLM Auction Agents

Public Goods Game Benchmark: Contribute and Punish, a Multi-Agent Benchmark

Elimination Game: Multi-Agent LLM Social Reasoning, Strategy, and Deception

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21

Multi-Agent Step Race Benchmark: LLM Collaboration and Deception Under Pressure

Show HN: LLM Thematic Generalization Benchmark

Show HN: LLM Creative Story-Writing Benchmark

Show HN: LLM Divergent Thinking Creativity Benchmark

Show HN: LLM Deceptiveness and Gullibility Benchmark

LLM Confabulation (Hallucination) Leaderboard

1 2