DeepSeek-Coder
DeepSeek-Coder copied to clipboard
Code to generate data
Thank you for the best code model to date!
Would it be possible to share the pre-training data generation code? —>
Data Creation
Step 1: Collect code data from GitHub and apply the same filtering rules as StarCoder Data to filter data. Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies. Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication. Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.
Hello, there are currently no plans to open-source the pre-training code.