Stories by aranguri

Predicting Rare LLM Failures with 30× Fewer Rollouts

Safety benchmarks are inflated because models know they're being tested

Probes trace an emergent jailbreak in OLMo 2 to mislabeled training data

Seeking mentees: new techniques for model diffing and data attribution

Seeking mentees: richer evals to address reward hacking and eval awareness

Tied Crosscoders: Tracing How Chat LLM Behavior Emerges from Base Model

Hacker group house in Palo Alto

Learning community zoom to teach each other new things 1-on-1 (sign in!)