datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Slow dataloading with big datasets issue persists

Open hwijeen opened this issue 3 years ago • 42 comments

Hi,

I reported too slow data fetching when data is large(#2210) a couple of weeks ago, and @lhoestq referred me to the fix (#2122). However, the problem seems to persist. Here is the profiled results:

  1. Running with 60GB
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  517.96         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
model_backward                     	|  0.26144        	|100            	|  26.144         	|  5.0475         	|
model_forward                      	|  0.11123        	|100            	|  11.123         	|  2.1474         	|
get_train_batch                    	|  0.097121       	|100            	|  9.7121         	|  1.8751         	|
  1. Running with 600GB, datasets==1.6.0
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  4563.2         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
get_train_batch                    	|  5.1279         	|100            	|  512.79         	|  11.237         	|
model_backward                     	|  4.8394         	|100            	|  483.94         	|  10.605         	|
model_forward                      	|  0.12162        	|100            	|  12.162         	|  0.26652        	|

I see that get_train_batch lags when data is large. Could this be related to different issues? I would be happy to provide necessary information to investigate.

hwijeen avatar Apr 23 '21 08:04 hwijeen

Hi ! Sorry to hear that. This may come from another issue then.

First can we check if this latency comes from the dataset itself ? You can try to load your dataset and benchmark the speed of querying random examples inside it ?

import time
import numpy as np

from datasets import load_from_disk

dataset = load_from_disk(...) # or from load_dataset...

_start = time.time()
n = 100
for i in np.random.default_rng(42).integers(0, len(dataset), size=n):
    _ = dataset[i]
print(time.time() - _start)

If we see a significant speed difference between your two datasets then it would mean that there's an issue somewhere

lhoestq avatar Apr 23 '21 10:04 lhoestq

Hi @lhoestq, here is the result. I additionally measured time to load_from_disk:

  • 60GB
loading took:  22.618776321411133
ramdom indexing 100 times took: 0.10214924812316895
  • 600GB
loading took:  1176.1764674186707
ramdom indexing 100 times took: 2.853600025177002

Hmm.. I double checked that it's version 1.6.0. The difference seems quite big, could it be related to the running environment?

hwijeen avatar Apr 23 '21 11:04 hwijeen

I'm surprised by the speed change. Can you give more details about your dataset ? The speed depends on the number of batches in the arrow tables and the distribution of the lengths of the batches. You can access the batches by doing dataset.data.to_batches() (use only for debugging) (it doesn't bring data in memory).

Also can you explain what parameters you used if you used map calls ? Also if you have some code that reproduces the issue I'd be happy to investigate it.

lhoestq avatar Apr 23 '21 12:04 lhoestq

Also if you could give us more info about your env like your OS, version of pyarrow and if you're using an HDD or a SSD

lhoestq avatar Apr 23 '21 12:04 lhoestq

Here are some details of my 600GB dataset. This is a dataset AFTER the map function and once I load this dataset, I do not use map anymore in the training. Regarding the distribution of the lengths, it is almost uniform (90% is 512 tokens, and 10% is randomly shorter than that -- typical setting for language modeling).

len(batches):
492763

batches[0]: 
pyarrow.RecordBatch
attention_mask: list<item: uint8>
  child 0, item: uint8
input_ids: list<item: int16>
  child 0, item: int16
special_tokens_mask: list<item: uint8>
  child 0, item: uint8
token_type_ids: list<item: uint8>
  child 0, item: uint8

Here the some parameters to map function just in case it is relevant:

num_proc=1    # as multi processing is slower in my case
load_from_cache_file=False

hwijeen avatar Apr 26 '21 02:04 hwijeen

Regarding the environment, I am running the code on a cloud server. Here are some info:

Ubuntu 18.04.5 LTS   # cat /etc/issue
pyarrow                 3.0.0  # pip list | grep pyarrow

The data is stored in SSD and it is mounted to the machine via Network File System.

If you could point me to some of the commands to check the details of the environment, I would be happy to provide relevant information @lhoestq !

hwijeen avatar Apr 26 '21 02:04 hwijeen

I am not sure how I could provide you with the reproducible code, since the problem only arises when the data is big. For the moment, I would share the part that I think is relevant. Feel free to ask me for more info.

class MyModel(pytorch_lightning.LightningModule)
    def setup(self, stage):
        self.dataset = datasets.load_from_disk(path)
        self.dataset.set_format("torch")

    def train_dataloader(self):
        collate_fn = transformers.DataCollatorForLanguageModeling(
                tokenizer=transformers.ElectraTokenizerFast.from_pretrained(tok_path)
        )
        dataloader = torch.utils.DataLoader(
                self.dataset,
                batch_size=32,
                collate_fn=collate_fn,
                num_workers=8,
                pin_memory=True,
       )

hwijeen avatar Apr 26 '21 02:04 hwijeen

Hi ! Sorry for the delay I haven't had a chance to take a look at this yet. Are you still experiencing this issue ? I'm asking because the latest patch release 1.6.2 fixed a few memory issues that could have lead to slow downs

lhoestq avatar May 10 '21 16:05 lhoestq

Hi! I just ran the same code with different datasets (one is 60 GB and another 600 GB), and the latter runs much slower. ETA differs by 10x.

hwijeen avatar May 19 '21 02:05 hwijeen

@lhoestq and @hwijeen

Despite upgrading to datasets 1.6.2, still experiencing extremely slow (2h00) loading for a 300Gb local dataset shard size 1.1Gb on local HDD (40Mb/s read speed). This corresponds almost exactly to total data divided by reading speed implying that it reads the entire dataset at each load.

Stack details:

GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect

Python version: 3.7 (64-bit runtime) Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: GeForce GTX 1050 Nvidia driver version: 457.63 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] datasets==1.6.2 [pip3] transformers==4.5.1 [pip3] numpy==1.19.1 [pip3] numpydoc==1.1.0 [pip3] pytorch-metric-learning==0.9.98 [pip3] torch==1.8.1 [pip3] torchaudio==0.8.1 [pip3] torchvision==0.2.2 [conda] blas 2.16 mkl conda-forge [conda] cudatoolkit 10.2.89 hb195166_8 conda-forge [conda] libblas 3.8.0 16_mkl conda-forge [conda] libcblas 3.8.0 16_mkl conda-forge [conda] liblapack 3.8.0 16_mkl conda-forge [conda] liblapacke 3.8.0 16_mkl conda-forge [conda] mkl 2020.1 216 [conda] numpy 1.19.1 py37hae9e721_0 conda-forge [conda] numpydoc 1.1.0 py_1 conda-forge [conda] pytorch 1.8.1 py3.7_cuda10.2_cudnn7_0 pytorch [conda] pytorch-metric-learning 0.9.98 pyh39e3cac_0 metric-learning [conda] torchaudio 0.8.1 py37 pytorch [conda] torchvision 0.2.2 py_3 pytorch

BenoitDalFerro avatar May 19 '21 10:05 BenoitDalFerro

Hi @BenoitDalFerro how do your load your dataset ?

lhoestq avatar May 19 '21 10:05 lhoestq

Hi @lhoestq thanks for the quick turn-around, actually the plain vanilla way, without an particular knack or fashion, I tried to look into the documentation for some alternative but couldn't find any

dataset = load_from_disk(dataset_path=os.path.join(datasets_dir,dataset_dir))

BenoitDalFerro avatar May 19 '21 11:05 BenoitDalFerro

I’m facing the same issue when loading a 900GB dataset (stored via save_to_disk): load_from_disk(path_to_dir) takes 1.5 hours and htop consistently shows high IO rates > 120 M/s.

tsproisl avatar May 20 '21 19:05 tsproisl

@tsproisl same here, smells like ~~teen spirit~~ intended generator inadvertently ending up iterator

@lhoestq perhaps solution to detect bug location in code is to track its signature via HD read usage monitoring, option is to add tracking decorator on top each function and sequentially close all hatches from top to bottom, suggest PySmart https://pypi.org/project/pySMART/ a Smartmontools implementation

BenoitDalFerro avatar May 20 '21 20:05 BenoitDalFerro

I wasn't able to reproduce this on a toy dataset of around 300GB:

import datasets as ds

s = ds.load_dataset("squad", split="train")
s4000 = ds.concatenate_datasets([s] * 4000)
print(ds.utils.size_str(s4000.data.nbytes))  # '295.48 GiB'

s4000.save_to_disk("tmp/squad_4000")
import psutil
import time
from datasets import load_from_disk

disk = "disk0"  # You may have to change your disk here
iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
time1 = time.time()

s4000_reloaded = load_from_disk("tmp/squad_4000")

time2 = time.time()
iocnt2 = psutil.disk_io_counters(perdisk=True)[disk]

print(f"Blocks read {iocnt2.read_count - iocnt1.read_count}")  # Blocks read 18
print(f"Elapsed time: {time2 - time1:.02f}s")  # Elapsed time: 14.60s

Could you run this on your side and tell me if how much time it takes ? Please run this when your machine is idle so that other processes don't interfere.

I got these results on my macbook pro on datasets 1.6.2

lhoestq avatar May 21 '21 14:05 lhoestq

@lhoestq thanks, test running as we speak, bear with me

BenoitDalFerro avatar May 21 '21 15:05 BenoitDalFerro

Just tried on google colab and got ~1min for a 15GB dataset (only 200 times SQuAD), while it should be instantaneous. The time is spent reading the Apache Arrow table from the memory mapped file. This might come a virtual disk management issue. I'm trying to see if I can still speed it up on colab.

lhoestq avatar May 21 '21 15:05 lhoestq

@lhoestq what is Google Colab's HD read speed, is it possible to introspect incl. make like SSD or HDD ?

BenoitDalFerro avatar May 21 '21 16:05 BenoitDalFerro

@lhoestq Thank you! The issue is getting more interesting. The second script is still running, but it's definitely taking much longer than 15 seconds.

tsproisl avatar May 21 '21 16:05 tsproisl

Okay, here’s the ouput: Blocks read 158396 Elapsed time: 529.10s

Also using datasets 1.6.2. Do you have any ideas, how to pinpoint the problem?

tsproisl avatar May 21 '21 16:05 tsproisl

@lhoestq, @tsproisl mmmh still writing on my side about 1h to go, thinking on it are your large datasets all monoblock unsharded ? mine is 335 times 1.18Gb shards.

BenoitDalFerro avatar May 21 '21 16:05 BenoitDalFerro

The 529.10s was a bit too optimistic. I cancelled the reading process once before running it completely, therefore the harddrive cache probably did its work.

Here are three consecutive runs First run (freshly written to disk): Blocks read 309702 Elapsed time: 1267.74s Second run (immediately after): Blocks read 113944 Elapsed time: 417.55s Third run (immediately after): Blocks read 42518 Elapsed time: 199.19s

tsproisl avatar May 21 '21 18:05 tsproisl

@lhoestq First test

elapsed time: 11219.05s

Second test running bear with me, for Windows users slight trick to modify original "disk0" string:

First find physical unit relevant key in dictionnary

import psutil
psutil.disk_io_counters(perdisk=True)

{'PhysicalDrive0': sdiskio(read_count=18453286, write_count=4075333, read_bytes=479546467840, write_bytes=161590275072, read_time=20659, write_time=2464), 'PhysicalDrive1': sdiskio(read_count=1495778, write_count=388781, read_bytes=548628622336, write_bytes=318234849280, read_time=426066, write_time=19085)}

In my case it's PhysicalDrive1

Then insert relevant key's string as disk variable

psutil.disk_io_counters()
disk = 'PhysicalDrive1'  # You may have to change your disk here
iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
time1 = time.time()
s4000_reloaded = load_from_disk("your path here")
time2 = time.time()
iocnt2 = psutil.disk_io_counters(perdisk=True)[disk]
print(f"Blocks read {iocnt2.read_count - iocnt1.read_count}")  # Blocks read 18
print(f"Elapsed time: {time2 - time1:.02f}s")  # Elapsed time: 14.60s

BenoitDalFerro avatar May 21 '21 21:05 BenoitDalFerro

@lhoestq Second test

Blocks read 1265609 Elapsed time: 11216.55s

BenoitDalFerro avatar May 22 '21 08:05 BenoitDalFerro

@lhoestq any luck ?

BenoitDalFerro avatar May 26 '21 14:05 BenoitDalFerro

Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no read operations which enables a very fast loading.

I did some tests on google colab and have the same issue. The first time the dataset arrow file is memory mapped takes always a lot of time (time seems linear with respect to the dataset size). Reloading the dataset is then instantaneous since the arrow file has already been memory mapped.

I also tried using the Arrow IPC file format (see #1933) instead of the current streaming format that we use but it didn't help.

Memory mapping is handled by the OS and depends on the disk you're using, so I'm not sure we can do much about it. I'll continue to investigate anyway, because I still don't know why in some cases it would go through the entire file (high Blocks read as in your tests) and in other cases it would do almost no reading.

lhoestq avatar May 26 '21 15:05 lhoestq

@lhoestq thanks for the effort, let's stay in touch

BenoitDalFerro avatar May 27 '21 08:05 BenoitDalFerro

Just want to say that I am seeing the same issue. Dataset size if 268GB and it takes 3 hours to load load_from_disk, using dataset version 1.9.0. Filesystem underneath is Lustre

gurvindersingh avatar Jul 20 '21 15:07 gurvindersingh

Hi @lhoestq, confirmed Windows issue, exact same code running on Linux OS total loading time about 3 minutes.

BenoitDalFerro avatar Jul 23 '21 08:07 BenoitDalFerro

Hmm that's different from what I got. I was on Ubuntu when reporting the initial issue.

hwijeen avatar Jul 23 '21 08:07 hwijeen