DeepSeek-Coder icon indicating copy to clipboard operation
DeepSeek-Coder copied to clipboard

Code to generate data

Open tbressers opened this issue 2 years ago • 1 comments

Thank you for the best code model to date!

Would it be possible to share the pre-training data generation code? —>

Data Creation

Step 1: Collect code data from GitHub and apply the same filtering rules as StarCoder Data to filter data. Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies. Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication. Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.

tbressers avatar Mar 01 '24 20:03 tbressers

Hello, there are currently no plans to open-source the pre-training code.

guoday avatar Mar 12 '24 02:03 guoday