data icon indicating copy to clipboard operation
data copied to clipboard

S3FileLoader clears contents of local file when s3 object name == local file relative path name

Open ringohoffman opened this issue 2 years ago • 2 comments

🐛 Describe the bug

When an S3 object name matches the relative path of a local file, the file's contents get cleared after loading the object data.

import torchdata
from torchdata.datapipes.iter import IterableWrapper, S3FileLoader


my_bucket = ...
local_file = ...
local_file_contents = "a,b,c,d\n1,2,3,4"

with open(local_file, "w") as outfile:
    outfile.write(local_file_contents)

with open(local_file, "r") as infile:
    file_text = infile.read()

print(file_text == local_file_contents)  # True

datapipe = IterableWrapper([f"s3://{my_bucket}/{local_file}"]).load_files_by_s3()

next(iter(datapipe))

with open(local_file, "r") as infile:
    file_text = infile.read()

print(file_text == local_file_contents)  # False
print(file_text == "")  # True

I am pretty sure it is bc of this. I think this could probably be fixed by opening the stream in a tempfile instead of just using object_name_.c_str().

Versions

$ python collect_env.py 
Collecting environment information...
PyTorch version: 1.12.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.5.119
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 495.29.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] facenet-pytorch==2.5.2
[pip3] mypy==0.971
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.0
[pip3] torch==1.12.1+cu113
[pip3] torchaudio==0.11.0+cu113
[pip3] torchdata==0.4.1
[pip3] torchvision==0.13.1+cu113
[conda] facenet-pytorch           2.5.2                    pypi_0    pypi
[conda] numpy                     1.22.0                   pypi_0    pypi
[conda] torch                     1.12.1+cu113             pypi_0    pypi
[conda] torchaudio                0.11.0+cu113             pypi_0    pypi
[conda] torchdata                 0.4.1                    pypi_0    pypi
[conda] torchvision               0.13.1+cu113             pypi_0    pypi

ringohoffman avatar Aug 16 '22 00:08 ringohoffman

cc: @ydaiming

ejguan avatar Aug 16 '22 13:08 ejguan

@ringohoffman Thanks for identifying the source of this behavior. May I ask you to create a PR based on this please? Though much less likely, temp files may still replace local files. I'd suggest to prepend a folder name for downloaded data files e.g. tmp or tmp/data. Thanks.

ydaiming avatar Aug 16 '22 17:08 ydaiming