Code-Pile
Code-Pile copied to clipboard
GitHub Diffs
A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string.
This uses PyArrow via dask
to save to parquet, which makes it easily parallelisable and gives low memory usage.
See #31