data icon indicating copy to clipboard operation
data copied to clipboard

CacheHolder buffer size limited to 1000

Open rehno-lindeque opened this issue 1 year ago • 7 comments

🐛 Describe the bug

The .on_disk_cache() data pipe uses .demux() under the hood with a default buffer_size of 1000.

Unfortunately this appears to break when the source datapipe has more than 1000 elements. See https://github.com/pytorch/data/blob/v0.4.1/torchdata/datapipes/iter/util/cacheholder.py#L251-L260

For example, to reproduce something like this should show the problem:

import tempfile
      
temp_dir = tempfile.gettempdir()
caching_datapipe = datapipes.iter.IterableWrapper(
    range(1001)
)
caching_datapipe = caching_datapipe.on_disk_cache(
    filepath_fn=lambda i: f"{temp_dir}/{i}.tmp"
)
      
# caching_datapipe = caching_datapipe.map(fn=_get_from_http) # (not needed for reproducing)
      
caching_datapipe = caching_datapipe.end_caching(
    filepath_fn=lambda i: f"{temp_dir}/{i}.tmp"
)

for element in caching_datapipe:
    print(element)

The stack trace ends with:

line 357, in _find_next
    raise BufferError(
BufferError: DemultiplexerIterDataPipe buffer overflow, buffer size 1000 is insufficient.
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=_ChildDataPipe, fn=<lambda>, input_col=None, output_col=None)

(I haven't been able to find a suitable work-around for this since the demux is hidden inside the implementation of OnDiskCacheHolder. Any ideas/advice for temporarily working around the issue would also be highly appreciated)

Versions

Collecting environment information...
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: NixOS 22.05 (Quokka) (x86_64)
GCC version: (GCC) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.6 (main, Aug  1 2022, 20:38:21) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.18.19-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version: 515.48.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.971
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.1
[pip3] torch==1.12.1+cu116
[pip3] torchdata==0.4.1
[pip3] torchvision==0.13.1+cu116
[conda] Could not collect

rehno-lindeque avatar Sep 19 '22 19:09 rehno-lindeque

Thank you for opening the issue. The reason that demux's buffer blows up because we will yield cached data first then todo. See: https://github.com/pytorch/data/blob/983e87ada583b7a58d13a1a5f047dd9d256155dd/torchdata/datapipes/iter/util/cacheholder.py#L424

However, it seems weird to me to change the order to yield data from todo then cached. A proper solution might be adding an argument to end_caching for buffer_size.

ejguan avatar Sep 19 '22 19:09 ejguan

Personally I think I would expect caching to be FIFO with respect to the source datapipe, in order to be as deterministic as possible.

In other words, if the on-disk cache can cause data in the pipe to be re-ordered as a side-effect, then surely this harms reproducibility. (I.e. a previous run can affect the outcome of the current run)

(Instead of demux, maybe a sum type is needed to properly retain the original ordering of the source datapipe? EDIT: I suppose this would require some kind of filtered "view" of the pipe to maintain the current API...)

rehno-lindeque avatar Sep 20 '22 02:09 rehno-lindeque

Personally I think I would expect caching to be FIFO with respect to the source datapipe, in order to be as deterministic as possible.

Agree. Will take a look later today

ejguan avatar Sep 20 '22 14:09 ejguan

Thank you, I appreciate the support!

rehno-lindeque avatar Sep 20 '22 16:09 rehno-lindeque

Did a quick research and it looks doable by adding a DataPipe to join two DataPipe with a key_fn should do the job. And, it does require some changes in CacheHolder. And, I need to discuss with other users about the usage of such kind of DataPipe to consolidate the API.

ejguan avatar Sep 20 '22 20:09 ejguan