litdata DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example

DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example

Open wen020 opened this issue 1 year ago • 5 comments

DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example error: AttributeError: 'SlimPajamaDataRecipe' object has no attribute 'is_generator' the type of SlimPajamaDataRecipe is DataChunkRecipe, and i find DataChunkRecipe object has no attribute 'is_generator'

May 15 '24 04:05 wen020

Hi! thanks for your contribution!, great first issue!

May 15 '24 04:05 github-actions[bot]

Hey @wen020, you can resolve this simply by making a contribution to LitGPT to add the missing property on the SlimPajamaDataRecipe.

Here: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/data/prepare_slimpajama.py#L13

It should be:

class SlimPajamaDataRecipe(DataChunkRecipe):
    
    is_generator = True # missing attribute

    def __init__(self, tokenizer: Tokenizer, chunk_size: int):
        super().__init__(chunk_size)
        self.tokenizer = tokenizer

    def prepare_structure(self, input_dir):
        files = Path(input_dir).rglob("*.zst")
        return [str(file) for file in files]

    def prepare_item(self, filepath):
        import zstandard as zstd

        with zstd.open(open(filepath, "rb"), "rt", encoding="utf-8") as f:
            for row in f:
                text = json.loads(row)["text"]
                if json.loads(row)["meta"]["redpajama_set_name"] == "RedPajamaGithub":
                    continue  # exclude the GitHub data since it overlaps with starcoder
                text_ids = self.tokenizer.encode(text, bos=False, eos=True)
                yield text_ids

May 15 '24 07:05 tchaton

May 21 '24 09:05 wen020

@wen020 Do you want to make a PR to LitGPT to fix this issue ?

May 21 '24 10:05 tchaton

I will to make a PR to LitGPT

May 22 '24 01:05 wen020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 16 '25 06:04 stale[bot]

This has been resolved.

Apr 17 '25 05:04 bhimrazy

litdata litdata copied to clipboard

DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example

litdata
litdata copied to clipboard