litgpt
litgpt copied to clipboard
Error in "_merge_no_wait": The config isn't consistent between chunks. This shouldn't have happened.
Hello,
I am pretraining Tinyllama on Lightning AI studio on my custom dataset. I am using prepare_starcoder.py
to convert the parquet files because my data has one folder of parquet files. After it writes .bin
files it raises an error in the commented section below.
Error:
raise Exception("The config isn't consistent between chunks. This shouldn't have happened."
File location:
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/data/streaming/writer.py
I could not resolve the problem, commented it out, and trained the model. But, I want to ensure it does not affect anything bad. I would appreciate it if you could address the issue.
def _merge_no_wait(self, node_rank: Optional[int] = None) -> None:
"""Once all the workers have written their own index, the merge function is responsible to read and merge them
into a single index."""
files = os.listdir(self._cache_dir)
index_files = [f for f in files if f.endswith(_INDEX_FILENAME)]
chunks_info = []
config = None
for index_filename in sorted(index_files):
chunk_path = os.path.join(self._cache_dir, index_filename)
with open(chunk_path) as f:
data = json.load(f)
if config is None:
config = data["config"]
#elif config != data["config"]:
# print(config)
# print("\n\n\n")
# print(data['config'])
# breakpoint()
#raise Exception("The config isn't consistent between chunks. This shouldn't have happened.")
chunks_info.extend(data["chunks"])
os.remove(chunk_path)
if node_rank is None:
with open(os.path.join(self._cache_dir, _INDEX_FILENAME), "w") as f:
json.dump({"chunks": chunks_info, "config": config}, f, sort_keys=True)
else:
with open(os.path.join(self._cache_dir, f"{node_rank}-{_INDEX_FILENAME}"), "w") as f:
json.dump({"chunks": chunks_info, "config": config}, f, sort_keys=True)
cc @tchaton or @awaelchli
@eljanmahammadli Are you using one of our Studio templates for this? Would you mind sharing your prepare_starcoder.py
implementation?
Hey @eljanmahammadli, did you use the LitData App to prepare your data ? This happens if the type of your data isn't deterministic among workers.
@awaelchli I am using the "Pretrain LLMs - TinyLlama 1.1B" template from the studio. Below is the code with minimal changes. I have changed the column name. And my data is just one .parquet
file.
import os
import sys
import time
import traceback
from pathlib import Path
import pyarrow.parquet as pq
from lightning.data.streaming import DataChunkRecipe, DataProcessor
# support running without installing as a package
wd = Path(__file__).parent.parent.resolve()
sys.path.append(str(wd))
from lit_gpt import Tokenizer
class StarcoderDataRecipe(DataChunkRecipe):
def __init__(self, tokenizer: Tokenizer, chunk_size: int):
super().__init__(chunk_size)
self.tokenizer = tokenizer
def prepare_structure(self, input_dir):
files = Path(input_dir).rglob("*.parquet")
print(files)
return [str(file) for file in files]
def prepare_item(self, item_metadata):
filepath = item_metadata
start = time.time()
try:
parquet_file = pq.ParquetFile(filepath)
# reduce RAM usage
for batch in parquet_file.iter_batches(batch_size=8192, columns=["text"]):
for text in batch.to_pandas()["text"]:
yield self.tokenizer.encode(text, bos=False, eos=True)
except Exception:
print(traceback.format_exc())
print(f"Error reading {filepath}")
return
parquet_file.close()
end = time.time()
print(f"Took {end - start:.2f} seconds total", filepath)
def prepare(
input_dir: Path = Path("data/starcoderdata"),
output_dir: Path = Path("data/starcoder"),
tokenizer_path: Path = Path("checkpoints/Llama-2-7b-hf/"),
chunk_size: int = (2049 * 8192),
fast_dev_run: bool = False,
) -> None:
tokenizer = Tokenizer(tokenizer_path)
data_recipe = StarcoderDataRecipe(tokenizer=tokenizer, chunk_size=chunk_size)
data_processor = DataProcessor(
input_dir=str(input_dir),
output_dir=str(output_dir),
fast_dev_run=fast_dev_run,
num_workers=os.cpu_count(),
num_downloaders=1,
)
start_time = time.time()
data_processor.run(data_recipe)
elapsed_time = time.time() - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")
if __name__ == "__main__":
from jsonargparse import CLI
CLI(prepare)
Beside, I am using my own tokenizer which I have used below code to trained following HuggingFace tutorial.
def get_training_corpus():
dataset = raw_datasets["train"]
for start_idx in range(0, len(dataset), 1000):
samples = dataset[start_idx : start_idx + 1000]
yield samples["text"]
training_corpus = get_training_corpus()
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
@eljanmahammadli Do you think you could provide a full reproducible script with synthetic data, so I can debug it ?
I am in the "Pretrain LLMs - TinyLlama 1.1B" on Lightning AI Studios.
First we get custom tokenizer:
python lit-gpt/scripts/download.py \
--repo_id eljanmahammadli/simhash_dedup_tokenizer_1_5M \
--access_token HF_TOKEN_HERE \
--tokenizer_only true
Then clone the data:
git clone https://huggingface.co/datasets/eljanmahammadli/sample_data data/sample-raw
You have to change the column name to the "text" on the line below in the "lit-gpt/scripts/prepare_starcoder.py":
for batch in parquet_file.iter_batches(batch_size=8192, columns=["text"]):
Finally using "prepare_starcoder.py" converting parquet files to bins. But
python lit-gpt/scripts/prepare_starcoder.py \
--input_dir data/sample-raw/data \
--output_dir data/sample \
--tokenizer_path checkpoints/simhash_dedup_tokenizer_1_5M
I want to know what kind of effect does this error has in the training the model as @tchaton pointed out there is inconsistency between data types.
Hey @eljanmahammadli,
This error indicates the StreamingDataset won't know what de-serializers to use during training and would fail at some point when reaching the outlier samples.
The optimize script print the inferred types when starting processing. Did you see any anomalies ?
Do you think you could invite me (thomasgridai) to your teamspace, so I can duplicate your Studio and try to figure out the source of the bug ?
Best, T.C
I don't see any option to specify the username for sharing. Could you please elaborate?
Hey @eljanmahammadli
If you go in your Teamspace Settings > Click on Members Tab, you can invite people to your Teamspace.
Hey @tchaton. On this quote "would fail at some point when reaching the outlier samples". Until you spot any bugs, am I good to go ahead and train the model if it does not fail?