litgpt Error in "_merge_no_wait": The config isn't consistent between chunks. This shouldn't have happened.

Hello,

I am pretraining Tinyllama on Lightning AI studio on my custom dataset. I am using prepare_starcoder.py to convert the parquet files because my data has one folder of parquet files. After it writes .bin files it raises an error in the commented section below.

Error:

raise Exception("The config isn't consistent between chunks. This shouldn't have happened."

File location:

/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/data/streaming/writer.py

I could not resolve the problem, commented it out, and trained the model. But, I want to ensure it does not affect anything bad. I would appreciate it if you could address the issue.

    def _merge_no_wait(self, node_rank: Optional[int] = None) -> None:
        """Once all the workers have written their own index, the merge function is responsible to read and merge them
        into a single index."""
        files = os.listdir(self._cache_dir)
        index_files = [f for f in files if f.endswith(_INDEX_FILENAME)]

        chunks_info = []
        config = None
        for index_filename in sorted(index_files):
            chunk_path = os.path.join(self._cache_dir, index_filename)
            with open(chunk_path) as f:
                data = json.load(f)

                if config is None:
                    config = data["config"]

                #elif config != data["config"]:

                   # print(config)
                   # print("\n\n\n")
                   # print(data['config'])

                   # breakpoint()
                    #raise Exception("The config isn't consistent between chunks. This shouldn't have happened.")

                chunks_info.extend(data["chunks"])

            os.remove(chunk_path)

        if node_rank is None:
            with open(os.path.join(self._cache_dir, _INDEX_FILENAME), "w") as f:
                json.dump({"chunks": chunks_info, "config": config}, f, sort_keys=True)
        else:
            with open(os.path.join(self._cache_dir, f"{node_rank}-{_INDEX_FILENAME}"), "w") as f:
                json.dump({"chunks": chunks_info, "config": config}, f, sort_keys=True)

Mar 14 '24 06:03 eljanmahammadli

cc @tchaton or @awaelchli

Mar 14 '24 12:03 carmocca

@eljanmahammadli Are you using one of our Studio templates for this? Would you mind sharing your prepare_starcoder.py implementation?

Mar 14 '24 23:03 awaelchli

Hey @eljanmahammadli, did you use the LitData App to prepare your data ? This happens if the type of your data isn't deterministic among workers.

Mar 15 '24 07:03 tchaton

@awaelchli I am using the "Pretrain LLMs - TinyLlama 1.1B" template from the studio. Below is the code with minimal changes. I have changed the column name. And my data is just one .parquet file.

import os
import sys
import time
import traceback
from pathlib import Path

import pyarrow.parquet as pq
from lightning.data.streaming import DataChunkRecipe, DataProcessor

# support running without installing as a package
wd = Path(__file__).parent.parent.resolve()
sys.path.append(str(wd))

from lit_gpt import Tokenizer


class StarcoderDataRecipe(DataChunkRecipe):
    def __init__(self, tokenizer: Tokenizer, chunk_size: int):
        super().__init__(chunk_size)
        self.tokenizer = tokenizer

    def prepare_structure(self, input_dir):
        files = Path(input_dir).rglob("*.parquet")
        print(files)
        return [str(file) for file in files]

    def prepare_item(self, item_metadata):
        filepath = item_metadata
        start = time.time()

        try:
            parquet_file = pq.ParquetFile(filepath)
            # reduce RAM usage
            for batch in parquet_file.iter_batches(batch_size=8192, columns=["text"]):
                for text in batch.to_pandas()["text"]:
                    yield self.tokenizer.encode(text, bos=False, eos=True)

        except Exception:
            print(traceback.format_exc())
            print(f"Error reading {filepath}")
            return

        parquet_file.close()
        end = time.time()
        print(f"Took {end - start:.2f} seconds total", filepath)


def prepare(
    input_dir: Path = Path("data/starcoderdata"),
    output_dir: Path = Path("data/starcoder"),
    tokenizer_path: Path = Path("checkpoints/Llama-2-7b-hf/"),
    chunk_size: int = (2049 * 8192),
    fast_dev_run: bool = False,
) -> None:
    tokenizer = Tokenizer(tokenizer_path)
    data_recipe = StarcoderDataRecipe(tokenizer=tokenizer, chunk_size=chunk_size)
    data_processor = DataProcessor(
        input_dir=str(input_dir),
        output_dir=str(output_dir),
        fast_dev_run=fast_dev_run,
        num_workers=os.cpu_count(),
        num_downloaders=1,
    )

    start_time = time.time()
    data_processor.run(data_recipe)
    elapsed_time = time.time() - start_time
    print(f"Time taken: {elapsed_time:.2f} seconds")


if __name__ == "__main__":
    from jsonargparse import CLI

    CLI(prepare)

Beside, I am using my own tokenizer which I have used below code to trained following HuggingFace tutorial.

def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["text"]

training_corpus = get_training_corpus()
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)

Mar 15 '24 09:03 eljanmahammadli

@eljanmahammadli Do you think you could provide a full reproducible script with synthetic data, so I can debug it ?

Mar 15 '24 11:03 tchaton

I am in the "Pretrain LLMs - TinyLlama 1.1B" on Lightning AI Studios.

First we get custom tokenizer:

python lit-gpt/scripts/download.py \
   --repo_id eljanmahammadli/simhash_dedup_tokenizer_1_5M \
   --access_token HF_TOKEN_HERE \
   --tokenizer_only true

Then clone the data:

git clone https://huggingface.co/datasets/eljanmahammadli/sample_data data/sample-raw

You have to change the column name to the "text" on the line below in the "lit-gpt/scripts/prepare_starcoder.py":

for batch in parquet_file.iter_batches(batch_size=8192, columns=["text"]):

Finally using "prepare_starcoder.py" converting parquet files to bins. But

python lit-gpt/scripts/prepare_starcoder.py \
  --input_dir data/sample-raw/data \
  --output_dir data/sample \
  --tokenizer_path checkpoints/simhash_dedup_tokenizer_1_5M

I want to know what kind of effect does this error has in the training the model as @tchaton pointed out there is inconsistency between data types.

Apr 04 '24 12:04 eljanmahammadli

Hey @eljanmahammadli,

This error indicates the StreamingDataset won't know what de-serializers to use during training and would fail at some point when reaching the outlier samples.

The optimize script print the inferred types when starting processing. Did you see any anomalies ?

Do you think you could invite me (thomasgridai) to your teamspace, so I can duplicate your Studio and try to figure out the source of the bug ?

Best, T.C

Apr 04 '24 12:04 tchaton

I don't see any option to specify the username for sharing. Could you please elaborate?

Apr 04 '24 13:04 eljanmahammadli

Hey @eljanmahammadli

If you go in your Teamspace Settings > Click on Members Tab, you can invite people to your Teamspace.

Apr 04 '24 13:04 tchaton

Hey @tchaton. On this quote "would fail at some point when reaching the outlier samples". Until you spot any bugs, am I good to go ahead and train the model if it does not fail?

Apr 06 '24 16:04 eljanmahammadli

litgpt litgpt copied to clipboard

Error in "_merge_no_wait": The config isn't consistent between chunks. This shouldn't have happened.

litgpt
litgpt copied to clipboard