data-preparation icon indicating copy to clipboard operation
data-preparation copied to clipboard

Extending this codebase

Open chris-ha458 opened this issue 1 year ago • 0 comments

I was looking at this codebase and encountered this bit: https://github.com/bigscience-workshop/data-preparation/tree/main/sourcing/code_dataset#code-dataset-sourcing

The query to create the dataset can be found in query.sql. After creation the dataset was preprocessed with processing.py. Note that there is a bug in the script that filters only for GPL licenses instead of filtering them out. There are instructions to remove the bug but it is left there for reproducibility.

This leads me to believe that the code here is meant to be used and investigated "as is" and without modification. Is this repo primarily meant for reproducibility?

If i wanted to improve and extend it for an independent Dataset building project, should I fork it or work from a branch?

chris-ha458 avatar Jun 18 '23 04:06 chris-ha458