nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Generalize encode/decode for datasets

Open GMNGeoffrey opened this issue 1 year ago • 1 comments

This fixes a TODO to allow arbitrary encoding/decoding schemes for different datasets. To do so, I switched from pickle to dill, which extends pickle to enable things like pickling functions, including their referenced globals. dill is already a dependency of datasets, so this doesn't add any new dependencies.

This PR also includes some gitignore additions that I found necessary for my usage. I can alter the entries, remove it from this PR, or break it into a separate PR, as you prefer. Probably the most controversial addition would be data/*/samples/*, since that's not a format that is currently referenced in this codebase. I was using directories like that to save sample prompts for datasets. Happy to drop it if its inclusion is not desired.

GMNGeoffrey avatar Jan 05 '24 22:01 GMNGeoffrey

I also did a similar thing on my personal work, recommended.

AutomaticHourglass avatar Jan 26 '24 18:01 AutomaticHourglass