nanoGPT
nanoGPT copied to clipboard
Generalize encode/decode for datasets
This fixes a TODO to allow arbitrary encoding/decoding schemes for different datasets. To do so, I switched from pickle to dill, which extends pickle to enable things like pickling functions, including their referenced globals. dill is already a dependency of datasets, so this doesn't add any new dependencies.
This PR also includes some gitignore additions that I found
necessary for my usage. I can alter the entries, remove it from this
PR, or break it into a separate PR, as you prefer. Probably the most
controversial addition would be data/*/samples/*, since that's not
a format that is currently referenced in this codebase. I was using
directories like that to save sample prompts for datasets. Happy to
drop it if its inclusion is not desired.
I also did a similar thing on my personal work, recommended.