litdata icon indicating copy to clipboard operation
litdata copied to clipboard

DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example

Open wen020 opened this issue 1 year ago • 5 comments

DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example error: AttributeError: 'SlimPajamaDataRecipe' object has no attribute 'is_generator' the type of SlimPajamaDataRecipe is DataChunkRecipe, and i find DataChunkRecipe object has no attribute 'is_generator'

image

wen020 avatar May 15 '24 04:05 wen020

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar May 15 '24 04:05 github-actions[bot]

Hey @wen020, you can resolve this simply by making a contribution to LitGPT to add the missing property on the SlimPajamaDataRecipe.

Here: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/data/prepare_slimpajama.py#L13

It should be:

class SlimPajamaDataRecipe(DataChunkRecipe):
    
    is_generator = True # missing attribute

    def __init__(self, tokenizer: Tokenizer, chunk_size: int):
        super().__init__(chunk_size)
        self.tokenizer = tokenizer

    def prepare_structure(self, input_dir):
        files = Path(input_dir).rglob("*.zst")
        return [str(file) for file in files]

    def prepare_item(self, filepath):
        import zstandard as zstd

        with zstd.open(open(filepath, "rb"), "rt", encoding="utf-8") as f:
            for row in f:
                text = json.loads(row)["text"]
                if json.loads(row)["meta"]["redpajama_set_name"] == "RedPajamaGithub":
                    continue  # exclude the GitHub data since it overlaps with starcoder
                text_ids = self.tokenizer.encode(text, bos=False, eos=True)
                yield text_ids

tchaton avatar May 15 '24 07:05 tchaton

ok

wen020 avatar May 21 '24 09:05 wen020

@wen020 Do you want to make a PR to LitGPT to fix this issue ?

tchaton avatar May 21 '24 10:05 tchaton

I will to make a PR to LitGPT

wen020 avatar May 22 '24 01:05 wen020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 06:04 stale[bot]

This has been resolved.

bhimrazy avatar Apr 17 '25 05:04 bhimrazy