text
text copied to clipboard
How to load huge file of data?
Hi guys, is there any plans or tutorial that I can refer to in order to load huge file of data? I have two scenarios in mind that I would probably use
- Assuming I have huge files that have already split into several files. How to load all of those files on demand subsequently?
- Assuming I have a single huge file. How to load the content on demand?
Thanks guys!
same question, I have a big file, it's a big 2-D matrix of number, it will out of memory when I load it.i don't know how to use it to train
One solution that I have tried is to use linecache
from python build-in library. I don't know, however, whether this is the correct way. Maybe someone can give further comment and is it possible to integrate it with pytorch/text.
Below is the example of Dataset
that I have created using linecache
class LazyTextDataset(Dataset):
def __init__(self, filename):
self._filename = filename
self._total_data = 0
with open(filename, "r") as f:
self._total_data = len(f.readlines()) - 1
def __getitem__(self, idx):
line = linecache.getline(self._filename, idx + 1)
csv_line = csv.reader([line])
.....
def __len__(self):
return self._total_data
So far by using this code no performance drawbacks I experienced while training my model. GPU was effectively utilized as well.
how can I run this command with 8 GB RAM?
TEXT = data.Field(batch_first=True, tokenize=word_tokenize, lower=True)
LABEL = data.Field(sequential=False, unk_token=None)
train, dev, test = datasets.SNLI.splits(TEXT, LABEL)
TEXT.build_vocab(train, vectors=GloVe(name='840B', dim=300))
LABEL.build_vocab(train)
loading the Glove word embedding result in a memory error. Thanks!
I have an idea to load the data with multiple files. Basically just creating a wrapper for a list of dataset as following
class ListTabularDataset(object):
def __init__(self, metas):
self.tabular_datasets = []
for meta in metas:
self.tabular_datasets.append(
TabularDataset(**meta))
def __iter__(self):
for dataset in self.tabular_datasets:
for x in dataset.examples:
yield x
This wrapper is really a simple and initial design from me. I'm thinking of more sophisticated wrapper so that we can expose the attributes of the dataset itself based on the current files it currently opens. Any thoughts?
For loading the data partially, I have been experimenting to load partial data directly from google cloud storage. However, after seeing more deeply the implementation of our Dataset
it seems a bit hard to integrate as most of our concrete classes expect the whole examples loaded first when creating the object. I suggest to create a separate issue for this one?
I have the same initial request - any tutorial on how to load the content on demand from a large single file?
You can also collect byte offsets for each line in a large file and store it in a dictionary.
offset_dict = {}
with open(large_file_path, 'rb') as f:
f.readline() # move over header
for line in range(number_of_lines):
offset = f.tell()
offset_dict[line] = offset
In your Dataset, you will need to seek to the offset and read the line. A Dataset can look like the following:
class ExampleDataset(Dataset):
def __init__(self, large_file_path, offset_dict, ):
self.large_file_path = large_file_path
self.offset_dict = offset_dict
def __len__(self):
return len(self.offset_dict)
def __getitem__(self, line):
offset = self.offset_dict[line]
with open(self.large_file_path, 'r', encoding='utf-8') as f:
f.seek(offset)
line = f.readline()
return line
I'm facing this problem too. Although some interesting ideas have been suggested here I am struggling to find a good solution that works as well with creating fields and the vocabulary.
Is there some solution for this problem where you can build up fields and vocabs as well?
I have this problem to construct enwik9 dataset. Based on the discussions in this issue, I have some ideas:
- a list of sub-datasets
- a list of byte offsets and construct sub-datasets based on the byte offsets.
Any ideas? @fmassa @cpuhrsch @vincentqb @mttk
Right now, Dataset
classes usually function by loading and storing data and constructing vocabularies at creation time. To work with datasets too large to fit into memory, the storing part should be made optional. There's two separate issues in this thread, one working with massive files and another of working with multiple input files. I'll comment on the former with respect to the new dataset patterns.
Masive datasets
The new (subclass?) for large dataset should do the same thing as the regular dataset classes do on the first pass: go trough the entire dataset and construct vocabularies, collect metadata (byte offsets, # of instances), but not store anything in the data
or instances
attribute of the Dataset
class.
The dataset class should have an open filestream in place of the data
attribute, and the __iter__
and __getitem__
methods can be implemented to work with a stream in place of a list.
Something like this could illustrate the behavior. This "magic" line_to_instance
method would need to have pre-built vocabularies, which should be constructed in the first pass over the data (similar to what happens right now https://github.com/pytorch/text/blob/master/torchtext/datasets/text_classification.py#L126).
class MassiveDataset(Dataset):
def __init__(self, data_path, line_to_instance, dataset_metadata):
"""Initiate text-classification dataset.
Arguments:
data_path: path to file with data.
line_to_instance: a method converting a line of a file
to a dataset instance
dataset_metadata: information required to imitate an in-memory
dataset (length, offsets, ...)
"""
self.data_path = data_path
# should be reset in __iter__
self.data_stream = open(data_path, 'r')
self.current_offset = 0
self.meta = dataset_metadata
self.line_to_instance = line_to_instance
def __len__(self):
return len(self.meta['length'])
def __getitem__(self, line):
offset = self.meta['offset_dict'][line]
self.data_stream.seek(offset)
line = self.data_stream.readline()
instance = self.line_to_instance(line)
# reset to previous location for iteration
self.data_stream.seek(self.current_offset)
return instance
def __next__(self):
line = self.data_stream.readline()
self.current_offset = self.data_stream.tell()
return self.line_to_instance(line)
This lacks the logic of resetting the file stream, but it should illustrate the idea.
cc @zhangguanheng66
I have an idea to load the data with multiple files. Basically just creating a wrapper for a list of dataset as following
class ListTabularDataset(object): def __init__(self, metas): self.tabular_datasets = [] for meta in metas: self.tabular_datasets.append( TabularDataset(**meta)) def __iter__(self): for dataset in self.tabular_datasets: for x in dataset.examples: yield x
This wrapper is really a simple and initial design from me. I'm thinking of more sophisticated wrapper so that we can expose the attributes of the dataset itself based on the current files it currently opens. Any thoughts?
For loading the data partially, I have been experimenting to load partial data directly from google cloud storage. However, after seeing more deeply the implementation of our
Dataset
it seems a bit hard to integrate as most of our concrete classes expect the whole examples loaded first when creating the object. I suggest to create a separate issue for this one?
How could shuffling be done following your approach? To my understanding, training a neural network successfully requirest shuffling that data.
You might use offset to material part of your data which fit into your memory, like here. Do you use distributed data parallel for training the model? If so, you can send different part of your data to your GPU.
One solution that I have tried is to use
linecache
from python build-in library. I don't know, however, whether this is the correct way. Maybe someone can give further comment and is it possible to integrate it with pytorch/text.Below is the example of
Dataset
that I have created usinglinecache
class LazyTextDataset(Dataset): def __init__(self, filename): self._filename = filename self._total_data = 0 with open(filename, "r") as f: self._total_data = len(f.readlines()) - 1 def __getitem__(self, idx): line = linecache.getline(self._filename, idx + 1) csv_line = csv.reader([line]) ..... def __len__(self): return self._total_data
So far by using this code no performance drawbacks I experienced while training my model. GPU was effectively utilized as well.
@akurniawan Hi, the f.readlines() can also memory consuming.
following @mttk 's idea, I implemented the usable snippet:
from torch.utils.data.dataset import Dataset
from typing import Optional, Callable
import os
import multiprocessing
def apply_transform(transform: Callable, data):
try:
if isinstance(data, (list, tuple)):
return [transform(item) for item in data]
return transform(data)
except Exception as e:
raise RuntimeError(f'applying transform {transform}: {e}')
class MassiveDataset(Dataset):
def __init__(self, filename, transform: Optional[Callable] = None):
self.offset = []
self.n_data = 0
if not os.path.exists(filename):
raise ValueError(f'filename does not exist: {filename}')
with open(filename, 'rb') as fp:
self.offset = [0]
while fp.readline():
self.offset.append(fp.tell())
self.offset = self.offset[:-1]
self.n_data = len(self.offset)
self.filename = filename
self.fd = open(filename, 'rb', buffering=0)
self.lock = multiprocessing.Lock()
self.transform = transform
def __len__(self):
return self.n_data
def __getitem__(self, index: int):
if index < 0:
index = self.n_data + index
with self.lock:
self.fd.seek(self.offset[index])
line = self.fd.readline()
data = line.decode('utf-8').strip('\n')
return apply_transform(self.transform, data) if self.transform is not None else data
NB: open file with buffering=0 and multiprocessing.Lock() are used to avoid loading bad data (usually a bit from one part of the file and a bit from the another part of the file).
additionally, if using multiprocessing
in DataLoader, one could get such exception TypeError: cannot serialize '_io.BufferedReader' object
. This is caused by pickle
module used in multiprocessing, it cannot serialize _io.BufferedReader
, but dill can. Replacing multiprocessing
with multiprocess, things goes okay (major changes compare with multiprocessing, enhanced serialization is done with dill)
You can also collect byte offsets for each line in a large file and store it in a dictionary.
offset_dict = {} with open(large_file_path, 'rb') as f: f.readline() # move over header for line in range(number_of_lines): offset = f.tell() offset_dict[line] = offset
In your Dataset, you will need to seek to the offset and read the line. A Dataset can look like the following:
class ExampleDataset(Dataset): def __init__(self, large_file_path, offset_dict, ): self.large_file_path = large_file_path self.offset_dict = offset_dict def __len__(self): return len(self.offset_dict) def __getitem__(self, line): offset = self.offset_dict[line] with open(self.large_file_path, 'r', encoding='utf-8') as f: f.seek(offset) line = f.readline() return line
I know this is an older thread, but for those still looking you can improve the speed of getting offsets by loading in chunks. It's considerably faster.
def get_line_offsets(path: str, chunk_size: int = 2 ** 20) -> List[int]:
offsets = [0]
with open(path, "rb") as file:
chunk = file.readlines(chunk_size)
while chunk:
for line in chunk:
offsets.append(offsets[-1] + len(line))
print(f"Lines found: {len(offsets)}", end='\r')
chunk = file.readlines(chunk_size)
return offsets