Toggle navigation
HN
Paper
All
Show
Ask
Jobs
Top stories
Today
Last 7 days
Last months
This year
Stats
Stories by hynky
FinePDFs: 3T token dataset made from internet PDFs
3 points
hynky
2025-09-07T07:19:03Z
news.ycombinator.com
FineWeb2: Adapting Pre-Training Data Processing to Every Language
7 points
hynky
2025-06-27T22:52:03Z
arxiv.org
FineWeb2 dataset: A sparkling update with 1000s of languages
2 points
hynky
2024-12-08T10:55:37Z
huggingface.co