optimum get_wikitext2 has bug

System Info

optimum version 1.21.4 (latest)
# Use the official Python image from the Docker Hub
FROM public.ecr.aws/docker/library/python:3.10-slim

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

from optimum.gptq.data import get_wikitext2
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
get_wikitext2(tokenizer=tokenizer, nsamples=128, seqlen=32, split="train")

Produce warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors

Expected behavior

This is proposed fix:


def get_wikitext2(tokenizer: Any, seqlen: int, nsamples: int, split: str = "train"):
    if split == "train":
        data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    elif split == "validation":
        data = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    ## length of 288059 should be enough
    #text = "".join([" \n" if s == "" else s for s in data["text"][:1000]])

    dataset = []
    for _ in range(nsamples):
        while True:
            i = random.randint(0, len(data) - 1)
            text = data[i]["text"]
            if len(tokenizer.tokenize(text)) >= seqlen:
                enc = tokenizer(text, return_tensors="pt")
                break
        i = random.randint(0, enc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = enc.input_ids[:, i:j]
        attention_mask = torch.ones_like(inp)
        dataset.append({"input_ids": inp, "attention_mask": attention_mask})
    return dataset

Inspired by get_c4`` and get_c4_new```.

No warning is produced.

Sep 10 '24 08:09 alex-ber

@SunMarc is there a reason why get_wikitext2 is different than the other methods ?

Sep 11 '24 11:09 IlyasMoutawwakil

Not sure. This was something TheBloke coded back then.Maybe this is because data[i]["text"] is pretty long so it takes to while to find a text < seqlen ?

Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors

This does not happen as we are slicing the tokenized data after:

        i = random.randint(0, enc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = enc.input_ids[:, i:j]
        attention_mask = torch.ones_like(inp)

Sep 11 '24 12:09 SunMarc