Home / AI / LLM360/MegaMath

LLM360/MegaMath

Updated 6 Aug 2025

184 views

1 stars

8 opens

We introduce MegaMath, an open math pretraining dataset curated from diverse, math-focused sources, with over 300B tokens. MegaMath is curated via the following three efforts:]\

Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet.
Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity.
Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data.
MegaMath Compared to Existing Datasets
MegaMath is the largest open math pre-training dataset to date, surpassing DeepSeekMath (120B) tokens.

Information

Category AI
Added 16 Apr 2025
Last Updated 6 Aug 2025
Views 184
Clicks 8
Stars 1

LLM360/MegaMath

Information

Report Tool

Related AI Tools

Report This Tool

Thank You!