litdata icon indicating copy to clipboard operation
litdata copied to clipboard

LitData doesn't support s3 bucket connection outside server

Open sanyalsunny111 opened this issue 1 year ago • 12 comments

🚀 Feature

LitData should support s3 bucket connection for streaming data outside of the same server.

Motivation

Current LitData supports s3 bucket connection for within public prod server but not outside of that for instance a GCP server.

Additional context

Sebastian and Adrian motivated me to raise this issue.

sanyalsunny111 avatar Jun 25 '24 15:06 sanyalsunny111

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Jun 25 '24 15:06 github-actions[bot]

Hey @sanyalsunny111,

I am not sure I fully understand the issue.

tchaton avatar Jun 25 '24 17:06 tchaton

Could you provide the concrete code snippets and file paths (and studio names) to illustrate this to @tchaton with a concrete example to follow @sanyalsunny111

rasbt avatar Jun 25 '24 17:06 rasbt

acknowledged I will do it shortly.

sanyalsunny111 avatar Jun 25 '24 18:06 sanyalsunny111

@tchaton So, some dataset is uploaded to a publicly accessible s3 bucket and also in data prep of some teamspace. Now that I have tried to access this data using studio's public prod profile. However when I am trying to use the same data using s3 (yes I have configured through aws cli) or teamspace I couldn't access it. Below it a screenshot where it is asking for an access key.

image

sanyalsunny111 avatar Jun 27 '24 17:06 sanyalsunny111

Hey @sanyalsunny111. Can you share a reproducible script ?

tchaton avatar Jun 27 '24 18:06 tchaton

Sure @tchaton I am using litgpt w/ no changes. Here is a loom video I recorded https://www.loom.com/share/5b55bc4c23e3403ea3257cdf34ceab2e?sid=761c670b-d52d-465e-bafe-d86be5d239cb

sanyalsunny111 avatar Jun 27 '24 19:06 sanyalsunny111

Hey @sanyalsunny111 Any Studio I can duplicate ?

tchaton avatar Jun 27 '24 22:06 tchaton

here /thunder/Experiments-Sunny2024

sanyalsunny111 avatar Jun 28 '24 15:06 sanyalsunny111

@tchaton Luca made some modifications and for me it is working fine now. Thought of updating you. He changed below mentioned lines in /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/client.py

if has_shared_credentials_file or not _IS_IN_STUDIO or True:
            self._client = boto3.client(
                "s3", config=botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}, signature_version=botocore.UNSIGNED)
            ) 

sanyalsunny111 avatar Jun 28 '24 16:06 sanyalsunny111

Hey @sanyalsunny111. Can you make a PR with the fix ?

tchaton avatar Jun 30 '24 10:06 tchaton

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 05:04 stale[bot]

With storage_options support in LitData, unsigned S3 requests can be handled easily by passing a custom config via storage_options.

Example:

import botocore
from litdata import StreamingDataset

storage_options = {
    "config": botocore.config.Config(
        retries={"max_attempts": 1000, "mode": "adaptive"},
        signature_version=botocore.UNSIGNED,
    )
}

dataset = StreamingDataset(
    input_dir="s3://pl-flash-data/optimized_tiny_imagenet",
    storage_options=storage_options,
)

print("Number of samples in the dataset:", len(dataset))
print("First sample:", dataset[0])

This works smoothly for public S3 buckets.

bhimrazy avatar Jun 15 '25 13:06 bhimrazy