transformers icon indicating copy to clipboard operation
transformers copied to clipboard

how to load multiple text files in LineByLineTextDataset ?

Open mv96 opened this issue 2 years ago • 3 comments

mv96 avatar Sep 21 '22 09:09 mv96

Hi everyone,

I am a bit new to hugging face environment , I was trying to pretrain a model from scratch following taking some inspirations from this post

tutorial link

Question : can I pass all the text files to construct the dataset ?

dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path='MyData.tsv', block_size=128 )

and also If someone can explain me what the block size means ?

does it mean it will load 128 lines at a time to construct a batch of dataset ?

mv96 avatar Sep 21 '22 09:09 mv96

LineByLineTextDataset seems like do not provide such functionality, I think you can either combing those tsvs yourself or you could Extend a class similar to this:

from typing import Union, List
import os
from datasets import Dataset

class LineByLineTextDataset(Dataset):
    """
    This will be superseded by a framework-agnostic approach soon.
    """

    def __init__(self, tokenizer: PreTrainedTokenizer, file_paths: Union[str, List[str]], block_size: int):
        warnings.warn(
            DEPRECATION_WARNING.format(
                "https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py"
            ),
            FutureWarning,
        )
        if isinstance(file_paths, list):
            for file in file_paths:
                if os.path.isfile(file) is False:
                    raise ValueError(f"Input file path {file} not found")
        else:
            if os.path.isfile(file_paths) is False:
                raise ValueError(f"Input file path {file_paths} not found")
        # Here, we do not cache the features, operating under the assumption
        # that we will soon use fast multithreaded tokenizers from the
        # `tokenizers` repo everywhere =)
        logger.info(f"Creating features from dataset file at {file_paths}")

        all_lines = []
        for file in file_paths:
            with open(file, encoding="utf-8") as f:
                lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

            all_lines.extend(lines)

        batch_encoding = tokenizer(all_lines, add_special_tokens=True, truncation=True, max_length=block_size)
        self.examples = batch_encoding["input_ids"]
        self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i) -> Dict[str, torch.tensor]:
        return self.examples[i]

batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size) here it seems like block_size controls how much token should be present after encoding at max, so if line is 'too long' it will just truncate it

ZurabDz avatar Sep 21 '22 11:09 ZurabDz

Hi @ZurabDz,

Thanks, for the help.

So I will combine to get the single tsv file.

mv96 avatar Sep 21 '22 12:09 mv96

I think you should close an issue if it's resolved.

ZurabDz avatar Sep 21 '22 19:09 ZurabDz