nanoGPT
nanoGPT copied to clipboard
Got stucked at the "dataset = load_dataset("openwebtext")
The last outputs are "
Downloading builder script: 2.86kB [00:00, 3.19MB/s]
Downloading builder script: 2.86kB [00:00, 3.01MB/s]
Downloading builder script: 2.86kB [00:00, 3.07MB/s]
Downloading builder script: 2.86kB [00:00, 2.45MB/s]
Downloading builder script: 2.86kB [00:00, 2.98MB/s]
Downloading metadata: 1.15kB [00:00, 1.24MB/s]
Downloading builder script: 2.86kB [00:00, 3.08MB/s]
Downloading metadata: 1.15kB [00:00, 1.47MB/s]
Downloading metadata: 1.15kB [00:00, 1.21MB/s]
Downloading metadata: 1.15kB [00:00, 1.45MB/s]
Downloading metadata: 1.15kB [00:00, 1.49MB/s]
Downloading metadata: 1.15kB [00:00, 1.11MB/s] ",
and the code keeps being pending.
It might be a better idea to remove the datasets
dependency, and just download/parse the files directly.
I downloaded it from here: https://skylion007.github.io/OpenWebTextCorpus/
I'm still working on building an IterableDataset for training :)
Here is a draft of my code to read the files:
def read_all_files(filename_tar: str):
with tarfile.open(filename_tar, mode="r", encoding="utf-8") as inside_tar:
for xz_file in tqdm.tqdm(inside_tar, total=len(inside_tar.getnames())):
with inside_tar.extractfile(xz_file) as inside_xz:
with tarfile.open(
fileobj=inside_xz, mode="r:xz", encoding="utf-8"
) as txt_directory:
for txt_file in txt_directory:
yield {
"xz_file": xz_file.name,
"filename": txt_file.name,
"data": txt_directory.extractfile(txt_file)
.read()
.decode("utf-8"),
}