LLM360/MegaMath
AI
We introduce MegaMath, an open math pretraining dataset curated from diverse, math-focused sources, with over 300B tokens. MegaMath is curated via the following three efforts:]\
Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet.
Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity.
Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data.
MegaMath Compared to Existing Datasets
MegaMath is the largest open math pre-training dataset to date, surpassing DeepSeekMath (120B) tokens.
Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity.
Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data.
MegaMath Compared to Existing Datasets
MegaMath is the largest open math pre-training dataset to date, surpassing DeepSeekMath (120B) tokens.