Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

GitHub Diffs

Open herbiebradley opened this issue 2 years ago • 5 comments

A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string.

This uses PyArrow via dask to save to parquet, which makes it easily parallelisable and gives low memory usage.

See #31

herbiebradley avatar Oct 07 '22 12:10 herbiebradley