gpt-2-output-dataset
gpt-2-output-dataset copied to clipboard
Complete improvements of download script.
I modified the script to utilize data classes, JSON serialization, and the tqdm library, ensuring a seamless and informative data download process. It also offers options to specify data sizes, splits, and target example counts. (cool, cool!)
Little list of changes:
- Added a data class (
ChatData) for structuring GPT-related data. - Implemented a JSON encoder (
ChatDataEncoder) for custom serialization. - Created a class (
GPTData) to manage data download, processing, and saving. - Introduced methods for validating data sizes and splits.
- Utilized
tqdmfor a progress bar during data download. - Provided options for truncating data based on a target example count.
Usage (I thought this was necessary, soooo):
gpt_data = GPTData(target_examples=None)
gpt_data.download_and_save_data(data_size_fn='webtext', split_fn='train')
Testing:
It works perfectly—I've tested all sizes and splits. I also tried various example sizes and all in general. It worked flawlessly on my local machine (Linux).