litdata
litdata copied to clipboard
DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example
DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example error: AttributeError: 'SlimPajamaDataRecipe' object has no attribute 'is_generator' the type of SlimPajamaDataRecipe is DataChunkRecipe, and i find DataChunkRecipe object has no attribute 'is_generator'
Hi! thanks for your contribution!, great first issue!
Hey @wen020, you can resolve this simply by making a contribution to LitGPT to add the missing property on the SlimPajamaDataRecipe.
Here: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/data/prepare_slimpajama.py#L13
It should be:
class SlimPajamaDataRecipe(DataChunkRecipe):
is_generator = True # missing attribute
def __init__(self, tokenizer: Tokenizer, chunk_size: int):
super().__init__(chunk_size)
self.tokenizer = tokenizer
def prepare_structure(self, input_dir):
files = Path(input_dir).rglob("*.zst")
return [str(file) for file in files]
def prepare_item(self, filepath):
import zstandard as zstd
with zstd.open(open(filepath, "rb"), "rt", encoding="utf-8") as f:
for row in f:
text = json.loads(row)["text"]
if json.loads(row)["meta"]["redpajama_set_name"] == "RedPajamaGithub":
continue # exclude the GitHub data since it overlaps with starcoder
text_ids = self.tokenizer.encode(text, bos=False, eos=True)
yield text_ids
ok
@wen020 Do you want to make a PR to LitGPT to fix this issue ?
I will to make a PR to LitGPT
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This has been resolved.