litdata icon indicating copy to clipboard operation
litdata copied to clipboard

litdata with huggingface instead of S3

Open ehartford opened this issue 1 year ago • 6 comments
trafficstars

🚀 Feature

I wanna use litdata to stream huggingface dataset cerebras/SlimPajama-627B. (not S3)

Motivation

How can I stream huggingface dataset instead of S3

Pitch

I wanna stream huggingface dataset not S3

Alternatives

just to stream huggingface dataset instead of S3

Additional context

I wanna use huggingface dataset, not S3

ehartford avatar Mar 08 '24 07:03 ehartford

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Mar 08 '24 07:03 github-actions[bot]

Hey @ehartford. I have already prepared a version of SlimPajama. It is ready to use on the platform.

tchaton avatar Mar 08 '24 08:03 tchaton

Here is the code:

from litdata import StreamingDataset, CombinedStreamingDataset
from litdata.streaming.item_loader import TokensLoader
from tqdm import tqdm
import os
from torch.utils.data import DataLoader

train_datasets = [
    StreamingDataset(
        input_dir="s3://tinyllama-template/slimpajama/train/",
        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs 
        shuffle=True,
        drop_last=True,
    ),
    StreamingDataset(
        input_dir="s3://tinyllama-template/starcoder/",
        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs 
        shuffle=True,
        drop_last=True,
    ),
]

# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)

train_dataloader = DataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())

# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
    pass

tchaton avatar Mar 08 '24 09:03 tchaton

Ok but, is it better to support hugging face instead of having to copy the dataset to s3? Aws charges for ingress and egress

ehartford avatar Mar 08 '24 09:03 ehartford

Ok but, is it better to support hugging face instead of having to copy the dataset to s3?

we used to have some issues with the stability and reachability of HF models and datasets in the past so I may say that S3 is a more reliable alternative...

Borda avatar Mar 08 '24 10:03 Borda

Hey @ehartford. In order to stream datasets, we need to optimize the dataset first. We could have an auto-optimize version for the HF datasets, but it would still require to download the dataset and convert it.

HF supports some streaming with webdataset backend but I gave up on it as it was too un-reliable for anything serious. The pipe breaks, it doesn't support multi node, etc...

If you are interested in using any particular dataset, I recommend trying out the Lightning AI platform.

Here is an example where I prepare Wikipedia Swedish: https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles

And another one were I prepared SlimPajama & StarCoder: https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset.

Don't hesitate to ask any other questions :)

tchaton avatar Mar 08 '24 13:03 tchaton