nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Got stucked at the "dataset = load_dataset("openwebtext")

Open hanfluid opened this issue 2 years ago • 1 comments

The last outputs are " Downloading builder script: 2.86kB [00:00, 3.19MB/s]
Downloading builder script: 2.86kB [00:00, 3.01MB/s]
Downloading builder script: 2.86kB [00:00, 3.07MB/s]
Downloading builder script: 2.86kB [00:00, 2.45MB/s]
Downloading builder script: 2.86kB [00:00, 2.98MB/s]
Downloading metadata: 1.15kB [00:00, 1.24MB/s]
Downloading builder script: 2.86kB [00:00, 3.08MB/s]
Downloading metadata: 1.15kB [00:00, 1.47MB/s]
Downloading metadata: 1.15kB [00:00, 1.21MB/s]
Downloading metadata: 1.15kB [00:00, 1.45MB/s]
Downloading metadata: 1.15kB [00:00, 1.49MB/s]
Downloading metadata: 1.15kB [00:00, 1.11MB/s] ", and the code keeps being pending.

hanfluid avatar Jan 12 '23 22:01 hanfluid

It might be a better idea to remove the datasets dependency, and just download/parse the files directly.

I downloaded it from here: https://skylion007.github.io/OpenWebTextCorpus/

I'm still working on building an IterableDataset for training :)

Here is a draft of my code to read the files:

def read_all_files(filename_tar: str):
    with tarfile.open(filename_tar, mode="r", encoding="utf-8") as inside_tar:
        for xz_file in tqdm.tqdm(inside_tar, total=len(inside_tar.getnames())):
            with inside_tar.extractfile(xz_file) as inside_xz:
                with tarfile.open(
                    fileobj=inside_xz, mode="r:xz", encoding="utf-8"
                ) as txt_directory:
                    for txt_file in txt_directory:
                        yield {
                                "xz_file": xz_file.name,
                                "filename": txt_file.name,
                                "data": txt_directory.extractfile(txt_file)
                                .read()
                                .decode("utf-8"),
                            }

vgoklani avatar Jan 13 '23 01:01 vgoklani