text How to load huge file of data?

Hi guys, is there any plans or tutorial that I can refer to in order to load huge file of data? I have two scenarios in mind that I would probably use

Assuming I have huge files that have already split into several files. How to load all of those files on demand subsequently?
Assuming I have a single huge file. How to load the content on demand?

Thanks guys!

Sep 25 '17 22:09 akurniawan

same question, I have a big file, it's a big 2-D matrix of number, it will out of memory when I load it.i don't know how to use it to train

Sep 29 '17 02:09 LuZhenHuan

One solution that I have tried is to use linecache from python build-in library. I don't know, however, whether this is the correct way. Maybe someone can give further comment and is it possible to integrate it with pytorch/text.

Below is the example of Dataset that I have created using linecache

class LazyTextDataset(Dataset):
    def __init__(self, filename):
        self._filename = filename
        self._total_data = 0
        with open(filename, "r") as f:
            self._total_data = len(f.readlines()) - 1

    def __getitem__(self, idx):
        line = linecache.getline(self._filename, idx + 1)
        csv_line = csv.reader([line])
        .....

    def __len__(self):
        return self._total_data

So far by using this code no performance drawbacks I experienced while training my model. GPU was effectively utilized as well.

Sep 30 '17 12:09 akurniawan

how can I run this command with 8 GB RAM?

    TEXT = data.Field(batch_first=True, tokenize=word_tokenize, lower=True)
    LABEL = data.Field(sequential=False, unk_token=None)

   train, dev, test = datasets.SNLI.splits(TEXT, LABEL)
   TEXT.build_vocab(train, vectors=GloVe(name='840B', dim=300))
   LABEL.build_vocab(train)

loading the Glove word embedding result in a memory error. Thanks!

Jan 10 '18 07:01 NoaKel

I have an idea to load the data with multiple files. Basically just creating a wrapper for a list of dataset as following

class ListTabularDataset(object):
    def __init__(self, metas):
        self.tabular_datasets = []
        for meta in metas:
            self.tabular_datasets.append(
                TabularDataset(**meta))

    def __iter__(self):
        for dataset in self.tabular_datasets:
            for x in dataset.examples:
                yield x

This wrapper is really a simple and initial design from me. I'm thinking of more sophisticated wrapper so that we can expose the attributes of the dataset itself based on the current files it currently opens. Any thoughts?

For loading the data partially, I have been experimenting to load partial data directly from google cloud storage. However, after seeing more deeply the implementation of our Dataset it seems a bit hard to integrate as most of our concrete classes expect the whole examples loaded first when creating the object. I suggest to create a separate issue for this one?

Feb 08 '18 15:02 akurniawan

I have the same initial request - any tutorial on how to load the content on demand from a large single file?

May 24 '19 22:05 mridunarang

You can also collect byte offsets for each line in a large file and store it in a dictionary.

offset_dict = {}
with open(large_file_path, 'rb') as f:
    f.readline()  # move over header
    for line in range(number_of_lines):
        offset = f.tell()
            offset_dict[line] = offset

In your Dataset, you will need to seek to the offset and read the line. A Dataset can look like the following:

class ExampleDataset(Dataset):
    def __init__(self, large_file_path, offset_dict, ):
        self.large_file_path = large_file_path
        self.offset_dict = offset_dict
    
    def __len__(self):
        return len(self.offset_dict)
    
    def __getitem__(self, line):
        offset = self.offset_dict[line]
        with open(self.large_file_path, 'r', encoding='utf-8') as f:
            f.seek(offset)
            line = f.readline()
            return line

Jul 11 '19 09:07 VictorKuenstler

I'm facing this problem too. Although some interesting ideas have been suggested here I am struggling to find a good solution that works as well with creating fields and the vocabulary.

Is there some solution for this problem where you can build up fields and vocabs as well?

Sep 16 '19 08:09 maxtrem

I have this problem to construct enwik9 dataset. Based on the discussions in this issue, I have some ideas:

a list of sub-datasets
a list of byte offsets and construct sub-datasets based on the byte offsets.

Any ideas? @fmassa @cpuhrsch @vincentqb @mttk

Sep 16 '19 15:09 zhangguanheng66

Right now, Dataset classes usually function by loading and storing data and constructing vocabularies at creation time. To work with datasets too large to fit into memory, the storing part should be made optional. There's two separate issues in this thread, one working with massive files and another of working with multiple input files. I'll comment on the former with respect to the new dataset patterns.

Masive datasets

The new (subclass?) for large dataset should do the same thing as the regular dataset classes do on the first pass: go trough the entire dataset and construct vocabularies, collect metadata (byte offsets, # of instances), but not store anything in the data or instances attribute of the Dataset class. The dataset class should have an open filestream in place of the data attribute, and the __iter__ and __getitem__ methods can be implemented to work with a stream in place of a list.

Something like this could illustrate the behavior. This "magic" line_to_instance method would need to have pre-built vocabularies, which should be constructed in the first pass over the data (similar to what happens right now https://github.com/pytorch/text/blob/master/torchtext/datasets/text_classification.py#L126).

class MassiveDataset(Dataset):
    def __init__(self, data_path, line_to_instance, dataset_metadata):
        """Initiate text-classification dataset.
        Arguments:
            data_path: path to file with data.
            line_to_instance: a method converting a line of a file
                        to a dataset instance
            dataset_metadata: information required to imitate an in-memory 
                            dataset (length, offsets, ...)
        """

        self.data_path = data_path
        # should be reset in __iter__
        self.data_stream = open(data_path, 'r')
        self.current_offset = 0

        self.meta = dataset_metadata
        self.line_to_instance = line_to_instance

    def __len__(self):
        return len(self.meta['length'])

    def __getitem__(self, line):
        offset = self.meta['offset_dict'][line]
        self.data_stream.seek(offset)
        line = self.data_stream.readline()
        instance = self.line_to_instance(line)
        # reset to previous location for iteration
        self.data_stream.seek(self.current_offset)
        return instance

    def __next__(self):
        line = self.data_stream.readline()
        self.current_offset = self.data_stream.tell()
        return self.line_to_instance(line)

This lacks the logic of resetting the file stream, but it should illustrate the idea.

Sep 16 '19 18:09 mttk

cc @zhangguanheng66

Sep 17 '19 18:09 cpuhrsch

I have an idea to load the data with multiple files. Basically just creating a wrapper for a list of dataset as following
class ListTabularDataset(object):
    def __init__(self, metas):
        self.tabular_datasets = []
        for meta in metas:
            self.tabular_datasets.append(
                TabularDataset(**meta))

    def __iter__(self):
        for dataset in self.tabular_datasets:
            for x in dataset.examples:
                yield x
This wrapper is really a simple and initial design from me. I'm thinking of more sophisticated wrapper so that we can expose the attributes of the dataset itself based on the current files it currently opens. Any thoughts?

For loading the data partially, I have been experimenting to load partial data directly from google cloud storage. However, after seeing more deeply the implementation of our Dataset it seems a bit hard to integrate as most of our concrete classes expect the whole examples loaded first when creating the object. I suggest to create a separate issue for this one?

How could shuffling be done following your approach? To my understanding, training a neural network successfully requirest shuffling that data.

Aug 18 '20 08:08 qiuwei

You might use offset to material part of your data which fit into your memory, like here. Do you use distributed data parallel for training the model? If so, you can send different part of your data to your GPU.

Aug 20 '20 14:08 zhangguanheng66

One solution that I have tried is to use linecache from python build-in library. I don't know, however, whether this is the correct way. Maybe someone can give further comment and is it possible to integrate it with pytorch/text.

Below is the example of Dataset that I have created using linecache
class LazyTextDataset(Dataset):
    def __init__(self, filename):
        self._filename = filename
        self._total_data = 0
        with open(filename, "r") as f:
            self._total_data = len(f.readlines()) - 1

    def __getitem__(self, idx):
        line = linecache.getline(self._filename, idx + 1)
        csv_line = csv.reader([line])
        .....

    def __len__(self):
        return self._total_data
So far by using this code no performance drawbacks I experienced while training my model. GPU was effectively utilized as well.

@akurniawan Hi, the f.readlines() can also memory consuming.

Dec 21 '20 08:12 lonelydancer

following @mttk 's idea, I implemented the usable snippet:

from torch.utils.data.dataset import Dataset
from typing import Optional, Callable
import os
import multiprocessing

def apply_transform(transform: Callable, data):
    try:
        if isinstance(data, (list, tuple)):
            return [transform(item) for item in data]

        return transform(data)
    except Exception as e:
        raise RuntimeError(f'applying transform {transform}: {e}')


class MassiveDataset(Dataset):
    def __init__(self, filename, transform: Optional[Callable] = None):
        self.offset = []
        self.n_data = 0

        if not os.path.exists(filename):
            raise ValueError(f'filename does not exist: {filename}')

        with open(filename, 'rb') as fp:
            self.offset = [0]
            while fp.readline():
                self.offset.append(fp.tell())
            self.offset = self.offset[:-1]

        self.n_data = len(self.offset)

        self.filename = filename
        self.fd = open(filename, 'rb', buffering=0)
        self.lock = multiprocessing.Lock()

        self.transform = transform

    def __len__(self):
        return self.n_data

    def __getitem__(self, index: int):
        if index < 0:
            index = self.n_data + index
        
        with self.lock:
            self.fd.seek(self.offset[index])
            line = self.fd.readline()

        data = line.decode('utf-8').strip('\n')

        return apply_transform(self.transform, data) if self.transform is not None else data

NB: open file with buffering=0 and multiprocessing.Lock() are used to avoid loading bad data (usually a bit from one part of the file and a bit from the another part of the file).

additionally, if using multiprocessing in DataLoader, one could get such exception TypeError: cannot serialize '_io.BufferedReader' object. This is caused by pickle module used in multiprocessing, it cannot serialize _io.BufferedReader, but dill can. Replacing multiprocessing with multiprocess, things goes okay (major changes compare with multiprocessing, enhanced serialization is done with dill)

Jun 04 '21 14:06 airlsyn

You can also collect byte offsets for each line in a large file and store it in a dictionary.

offset_dict = {}
with open(large_file_path, 'rb') as f:
    f.readline()  # move over header
    for line in range(number_of_lines):
        offset = f.tell()
            offset_dict[line] = offset

In your Dataset, you will need to seek to the offset and read the line. A Dataset can look like the following:

class ExampleDataset(Dataset):
    def __init__(self, large_file_path, offset_dict, ):
        self.large_file_path = large_file_path
        self.offset_dict = offset_dict
    
    def __len__(self):
        return len(self.offset_dict)
    
    def __getitem__(self, line):
        offset = self.offset_dict[line]
        with open(self.large_file_path, 'r', encoding='utf-8') as f:
            f.seek(offset)
            line = f.readline()
            return line

I know this is an older thread, but for those still looking you can improve the speed of getting offsets by loading in chunks. It's considerably faster.

def get_line_offsets(path: str, chunk_size: int = 2 ** 20) -> List[int]:
    offsets = [0]
    with open(path, "rb") as file:
        chunk = file.readlines(chunk_size)
        while chunk:
            for line in chunk:
                offsets.append(offsets[-1] + len(line))
            print(f"Lines found: {len(offsets)}", end='\r')
            chunk = file.readlines(chunk_size)
    return offsets

Mar 07 '23 12:03 thorinf

text text copied to clipboard

How to load huge file of data?

Masive datasets

text
text copied to clipboard