files Zero
AI
OpenCodeReasoning is the largest reasoning-based synthetic dataset to date for coding, comprises 735,255 samples in Python across 28,319 unique competitive programming questions. OpenCodeReasoning is designed for supervised fine-tuning (SFT).
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Data Overview
OpenCodeReasoning is the largest reasoning-based synthetic dataset to date for coding, comprises 735,255 samples in Python across 28,319 unique competitive programming questions. OpenCodeReasoning is designed for supervised fine-tuning (SFT).
Technical Report - Discover the methodology and technical details behind OpenCodeReasoning.
Github Repo - Access the complete pipeline used to perform SFT.
This dataset is ready for commercial/non-commercial use.
Data distribution
The CodeForces problems are sourced from http://codeforces.com.
The question collections are gathered from TACO (https://huggingface.co/datasets/BAAI/TACO), APPS (https://huggingface.co/datasets/codeparrot/apps), CodeContests (https://huggingface.co/datasets/deepmind/code_contests), and open-r1/codeforces (https://huggingface.co/datasets/open-r1/codeforces).
We do not include the test split of CodeContests and open-r1/codeforces.
The output responses are generated by R1.
Data Overview
OpenCodeReasoning is the largest reasoning-based synthetic dataset to date for coding, comprises 735,255 samples in Python across 28,319 unique competitive programming questions. OpenCodeReasoning is designed for supervised fine-tuning (SFT).
Technical Report - Discover the methodology and technical details behind OpenCodeReasoning.
Github Repo - Access the complete pipeline used to perform SFT.
This dataset is ready for commercial/non-commercial use.
Data distribution
The CodeForces problems are sourced from http://codeforces.com.
The question collections are gathered from TACO (https://huggingface.co/datasets/BAAI/TACO), APPS (https://huggingface.co/datasets/codeparrot/apps), CodeContests (https://huggingface.co/datasets/deepmind/code_contests), and open-r1/codeforces (https://huggingface.co/datasets/open-r1/codeforces).
We do not include the test split of CodeContests and open-r1/codeforces.
The output responses are generated by R1.