datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Use public direct links for object for S3 and Azure

Open shcheklein opened this issue 1 year ago • 2 comments

Followup https://github.com/iterative/datachain/pull/755

Less critical of implementation since it affects only public, no credentials buckets and Studio teams. It works already for Google Storage since @dreadatour fixed it a while ago.

Public S3 and Azure client.url() code. Similar to GS that already has a check for anon in it - we need to generate and return direct URL to the cloud storage.

Make sure along the way:

  • Endpoint URLs are supported, especially for AWS
  • On the Studio side pass ms header to signed URL to get a public URL that actually works (see some SO discussions)
  • add tests

shcheklein avatar Dec 29 '24 16:12 shcheklein

Quick note: I have checked AWS S3 and it returns public URL out of the box if no credentials found:

In [1]: from datachain.catalog import get_catalog

In [2]: catalog = get_catalog()

In [3]: catalog.signed_url('s3://fast-ai-nlp', 'ag_news_csv.tgz')
Out[3]: 'https://fast-ai-nlp.s3.amazonaws.com/ag_news_csv.tgz'

This URL is actually works: https://fast-ai-nlp.s3.amazonaws.com/ag_news_csv.tgz

We still need to check all possible options for S3 and Azure.

dreadatour avatar Dec 30 '24 14:12 dreadatour

Thing to check for S3 if it works for versioned files (when you pass version_id)

shcheklein avatar Dec 31 '24 00:12 shcheklein