Stories by thw20

Simple, zero overhead way to compress model, KV cache via Low-Rank Decomposition

Towards understanding multiple attention sinks in LLMs

The Existence and Behavior of Secondary Attention Sinks