lit-llama
lit-llama copied to clipboard
about "PackedDataset"
Hi everyone,
Do you have resources that could help me understand "PackedDataset"? I'm trying to implement two things: (1) Multiprocessing script for the tokenization: which is done, I implemented the script with the help of this code: https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/c4/c4_reformat.py I used Pooling package, for every process, I changed the prefix name of the builder object to be: DatasetPrefix_ProcesID_Counter (example: c4_42424_00000001)
here is snippets from the code:
builder = packed_dataset.PackedDatasetBuilder(
outdir=destination_path,
prefix=set_name,
chunk_size=chunk_size,
sep_token= tokenizer.eos_id,
dtype="auto",
vocab_size=tokenizer.vocab_size,
)
func = partial(prepare_one_file, builder, tokenizer)
n_tasks_per_chunk = ceil(len(filenames) / num_cpus)
logging.info(f"Using {n_tasks_per_chunk} iterations")
with Pool(num_cpus) as p:
tokens_length =sum(p.map(func, filenames,chunksize=n_tasks_per_chunk))
prepare_one_file function:
def prepare_one_file(builder, tokenizer, filepath):
builder._prefix = 'c4_'+str(os.getpid())
text_length = 0
tokens_length = 0
line = 0
try:
with open(filepath, encoding="utf-8") as f:
for row in f:
try:
line+=1
text = json.loads(row)["text"]
text_length += len(text.split(' '))
text_ids = tokenizer.encode(text, bos=True)
tokens_length += len(text_ids)
builder.add_array(np.array(text_ids, dtype=builder.dtype))
except:
logging.info('Error in file: {} line NO: {}'.format(filepath,line ))
except:
logging.info('Error in opening file: {}'.format(filepath))
logging.info('Done writing file: {}, text_size: {}, tokens_size: {}, lines: {}'.format(filepath, text_length, tokens_length,line))
return tokens_length
What do you think about the implementation? does it store the data incorrectly in the memory (from my understanding the data should be sequential in the memory(?))
(2) Saving the last seen file by the model for resuming the training: I'm not sure if this is possible since the data is read in parallel. Besides from my understanding of the code, the data is shuffled in the beginning of the training, so it is not read sequentially. Is this correct?