litdata icon indicating copy to clipboard operation
litdata copied to clipboard

How to optimimize dataset for pretraining from HuggingFace

Open TheLukaDragar opened this issue 10 months ago • 7 comments

Description

I'm trying to optimize a dataset from HuggingFace using LitData for LLM pretraining. The code attempts to tokenize text data and create optimized chunks, but I'm encountering issues with the process.

Current Code

from pathlib import Path
from litdata import optimize, StreamingDataset, StreamingDataLoader
from litgpt.litgpt.tokenizer import Tokenizer
from functools import partial
import pyarrow.parquet as pq
from datasets import load_dataset
from litdata.streaming.item_loader import ParquetLoader

def tokenize_fn(text, tokenizer=None):
    yield tokenizer.encode(text[0][0], bos=False, eos=True)

if __name__ == "__main__":
    hf_dataset = StreamingDataset("hf://datasets/skadooah2/cultura_pretrain/data", 
                                item_loader=ParquetLoader)
    loader = StreamingDataLoader(hf_dataset, batch_size=1, num_workers=1)
    
    training_seq_len = 8192
    chunk_size = training_seq_len + 1
    
    outputs = optimize(
        fn=partial(tokenize_fn, 
                  tokenizer=Tokenizer("./checkpoints/meta-llama/Llama-3.2-3B")),
        inputs=loader,
        output_dir="/home/jakob/llara/pretrain2/",
        chunk_size=(chunk_size * 2048),
        reorder_files=True,
        num_workers=32
    )

i get File datasets is not a valid chunk file. It will be ignored. warning

Is there a more best practice way of doing this. Thanks!

TheLukaDragar avatar Feb 21 '25 13:02 TheLukaDragar

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Feb 21 '25 13:02 github-actions[bot]

data_processor.py

def _get_folder_size(path: str, config: ChunksConfig) -> int:
    """Collect the size of each files within a folder.

    This method is robust to file deletion races

    """
    size = 0
    for filename in os.listdir(path):
        print(f"Checking file: {filename}")
        if filename in config.filename_to_size_map:
            with contextlib.suppress(FileNotFoundError):
                size += config.filename_to_size_map[filename]
        elif not filename.endswith((".cnt", ".lock", ".json", ".zstd.bin")):
            # ignore .cnt, .lock, .json and .zstd files for warning
            logger.warning(f"File {filename} is not a valid chunk file. It will be ignored.")
    return size

this function is causing the the warnings.

this is the log:

File datasets is not a valid chunk file. It will be ignored.
Checking file: index.json
Checking file: datasets

and i get failure at

litdata/src/litdata/processing/data_processor.py", line 719, in _handle_data_chunk_recipe
raise RuntimeError(f"Failed processing {self.items[index]=}; {index=}") from e
RuntimeError: Failed processing self.items[index]=4596541; index=306436

I'm interested if the code is correct and litdata can support optimizing hf like this directly.

one more thing the text passes in tokenize is an array of tuples [("my text",)] i don't know if thats ok

TheLukaDragar avatar Feb 21 '25 14:02 TheLukaDragar

got it. Thanks for bringing the issue to our notice. Checking file: datasets.

Thanks. I'm not sure about the failing part. Looking more into it.

deependujha avatar Feb 21 '25 14:02 deependujha

Btw, would you like to make the PR to fix it?

All you need to do is to:

def _get_folder_size(path: str, config: ChunksConfig) -> int:
    """Collect the size of each files within a folder.

    This method is robust to file deletion races

    """
    size = 0
    for filename in os.listdir(path):
        if filename in config.filename_to_size_map:
            with contextlib.suppress(FileNotFoundError):
                size += config.filename_to_size_map[filename]
        elif not filename.endswith((".cnt", ".lock", ".json", ".zstd.bin")):
            # ignore .cnt, .lock, .json and .zstd files for warning
            logger.warning(f"File {filename} is not a valid chunk file. It will be ignored.")
    return size
  • For HF datasets, index.json contains filename in the format: "filename": "datasets/open-thoughts/OpenThoughts-114k/data/train-00004-of-00006.parquet".

  • We are using os.listdir that lists all the files in current directory. So rather we need to use walk, but filename for subdirectory files will be subdirectory1/subdirectory2/filename.txt, and then compare with the config map.

deependujha avatar Feb 21 '25 14:02 deependujha

Hi @TheLukaDragar,

I was able to reproduce the issue you reported—thanks for bringing it up!

btw you can ignore those files ignored warnings;

Image

Workaround:

  • You can refer to this Lightning Studio example to optimize the dataset from Hugging Face: 🔗 Optimize 2M Swedish Wikipedia Articles from @tchaton

  • And incase you encounter a segmentation fault error while streaming the optimized tokenized dataset, try hiding these lines:
    https://github.com/Lightning-AI/litdata/blob/f6660be93a97346c4da9feb81011544b93d88c13/src/litdata/streaming/reader.py#L404-L405

We're actively investigating these issues and will push a fix soon. Appreciate your patience!
Thanks 😊

cc: @tchaton

bhimrazy avatar Feb 24 '25 20:02 bhimrazy

Thanks, we were able to get the dataset going, by referring to the example. Image We planned to use the litdata dataset for further pretraining using litgpt (v0.5.7), but when trying to run the pretraining on the converted dataset I encountered the following error: [rank0]: roi.append((0, chunk["dim"] // item_loader._block_size)) [rank0]: TypeError: unsupported operand type(s) for //: 'NoneType' and 'int' The problem seemed to be that the "dim" of all of the chunks in the index.json file was set to null, e.g.: {"chunks": [{"chunk_bytes": 13137083940, "chunk_size": 2433844, "dim": null, "filename": "chunk-0-0.bin"}, I am not sure sure why but maybe this is causing the issue. After reverting from litdata release 0.2.39 back to v0.2.17 the issue seemed to be resolved.

MatejRojec avatar Feb 26 '25 14:02 MatejRojec

Hi @MatejRojec, thanks for reporting the issue!
In the latest versions, you also need to pass TokensLoader to optimize token handling before streaming:

# This informs LitData that we're encoding a contiguous 1D token array,  
# preventing unnecessary metadata storage.  
item_loader = TokensLoader()

You can find more details in the LLM Pre-training section of the README.
Image

Let us know if this helps, and feel free to ask if you have any further questions! Thanks.

bhimrazy avatar Feb 26 '25 14:02 bhimrazy