transformers
transformers copied to clipboard
how to load multiple text files in LineByLineTextDataset ?
Hi everyone,
I am a bit new to hugging face environment , I was trying to pretrain a model from scratch following taking some inspirations from this post
Question : can I pass all the text files to construct the dataset ?
dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path='MyData.tsv', block_size=128 )
and also If someone can explain me what the block size means ?
does it mean it will load 128 lines at a time to construct a batch of dataset ?
LineByLineTextDataset seems like do not provide such functionality, I think you can either combing those tsvs yourself or you could Extend a class similar to this:
from typing import Union, List
import os
from datasets import Dataset
class LineByLineTextDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach soon.
"""
def __init__(self, tokenizer: PreTrainedTokenizer, file_paths: Union[str, List[str]], block_size: int):
warnings.warn(
DEPRECATION_WARNING.format(
"https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py"
),
FutureWarning,
)
if isinstance(file_paths, list):
for file in file_paths:
if os.path.isfile(file) is False:
raise ValueError(f"Input file path {file} not found")
else:
if os.path.isfile(file_paths) is False:
raise ValueError(f"Input file path {file_paths} not found")
# Here, we do not cache the features, operating under the assumption
# that we will soon use fast multithreaded tokenizers from the
# `tokenizers` repo everywhere =)
logger.info(f"Creating features from dataset file at {file_paths}")
all_lines = []
for file in file_paths:
with open(file, encoding="utf-8") as f:
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
all_lines.extend(lines)
batch_encoding = tokenizer(all_lines, add_special_tokens=True, truncation=True, max_length=block_size)
self.examples = batch_encoding["input_ids"]
self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]
def __len__(self):
return len(self.examples)
def __getitem__(self, i) -> Dict[str, torch.tensor]:
return self.examples[i]
batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
here it seems like block_size controls how much token should be present after encoding at max, so if line is 'too long' it will just truncate it
Hi @ZurabDz,
Thanks, for the help.
So I will combine to get the single tsv file.
I think you should close an issue if it's resolved.