lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

about "PackedDataset"

Open LamOne1 opened this issue 1 year ago • 0 comments

Hi everyone,

Do you have resources that could help me understand "PackedDataset"? I'm trying to implement two things: (1) Multiprocessing script for the tokenization: which is done, I implemented the script with the help of this code: https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/c4/c4_reformat.py I used Pooling package, for every process, I changed the prefix name of the builder object to be: DatasetPrefix_ProcesID_Counter (example: c4_42424_00000001)

here is snippets from the code:

builder = packed_dataset.PackedDatasetBuilder(
            outdir=destination_path,
            prefix=set_name,
            chunk_size=chunk_size,
            sep_token= tokenizer.eos_id,
            dtype="auto",
            vocab_size=tokenizer.vocab_size,
        )
        
        func = partial(prepare_one_file, builder, tokenizer)
        n_tasks_per_chunk = ceil(len(filenames) / num_cpus)
        logging.info(f"Using {n_tasks_per_chunk} iterations")
        with Pool(num_cpus) as p:
            tokens_length =sum(p.map(func, filenames,chunksize=n_tasks_per_chunk))

prepare_one_file function:

def prepare_one_file(builder, tokenizer, filepath):
    builder._prefix = 'c4_'+str(os.getpid())
    text_length = 0
    tokens_length = 0
    line = 0

    try:
        with open(filepath, encoding="utf-8") as f:
            for row in f:
                try:
                    line+=1
                    text = json.loads(row)["text"]
                    text_length += len(text.split(' '))
                    text_ids = tokenizer.encode(text, bos=True)
                    tokens_length += len(text_ids)   
                    builder.add_array(np.array(text_ids, dtype=builder.dtype))
                except:
                    logging.info('Error in file: {} line NO: {}'.format(filepath,line ))
    except:
        logging.info('Error in opening file: {}'.format(filepath))
                
    logging.info('Done writing file: {}, text_size: {}, tokens_size: {}, lines: {}'.format(filepath, text_length, tokens_length,line))
                 
    return tokens_length

What do you think about the implementation? does it store the data incorrectly in the memory (from my understanding the data should be sequential in the memory(?))

(2) Saving the last seen file by the model for resuming the training: I'm not sure if this is possible since the data is read in parallel. Besides from my understanding of the code, the data is shuffled in the beginning of the training, so it is not read sequentially. Is this correct?

LamOne1 avatar Jun 13 '23 10:06 LamOne1