DALI icon indicating copy to clipboard operation
DALI copied to clipboard

[WIP]Support remote storage (e.g. Amazon s3) in DALI

Open xutianming opened this issue 5 years ago • 19 comments

Hi all,

I was using DALI with PyTorch recently and am impressed by its excellent performance. Currently all of my training data have to be placed on local SSD storage to use DALI. But in production, we have to read from remote storage, such as Amazon S3. I want to extend DALI to support remote filesystem. Do you think it's possible? Hope to get some early advise from the developers of DALI.

xutianming avatar Sep 20 '19 03:09 xutianming

Hi, thanks for the question. If you are willing to contribute we would be happy to help.

For now, we don't see any technical reason for this to be impossible. Although some of our readers are using random access to images (e.x. FileReader) it's not necessary. It's mostly about shuffling patterns after all.

I'm not very familiar with Amazon platform. Let me reach out to some colleagues and get back to you.

awolant avatar Sep 20 '19 17:09 awolant

@xutianming We discussed this and we do not see any other technical issues, so it's definitely doable feature. We are very interested in adding it, if you are willing to contribute. It might be useful for many other users.

As you mentioned, the way to approach this would be to write something like RemoteFilesystemReader and have Amazon S3 as one of the possible use cases (e.g. GCS being other one). Could you share some more details about your use case and how you plan to implement this.

@jantonguirao also raised a question about mounting Amazon S3 as a storage and using our existing readers. Is that not enough for you for some reason?

awolant avatar Sep 25 '19 13:09 awolant

@xutianming We discussed this and we do not see any other technical issues, so it's definitely doable feature. We are very interested in adding it, if you are willing to contribute. It might be useful for many other users.

As you mentioned, the way to approach this would be to write something like RemoteFilesystemReader and have Amazon S3 as one of the possible use cases (e.g. GCS being other one). Could you share some more details about your use case and how you plan to implement this.

@jantonguirao also raised a question about mounting Amazon S3 as a storage and using our existing readers. Is that not enough for you for some reason?

S3-mount is a good choice with almost no extra development. We are maintaining multi-tenant clusters. PyTorch workers are running in dockers and share machines with each other. We found that solutions based on FUSE might get stuck in Linux kernel in this scenario.

Currently, I am planning to extend FileStream to support something like RemoteFileStream . I am working on a demo for proof of concept.

xutianming avatar Sep 26 '19 07:09 xutianming

@xutianming were you able to make it work?

oleksandrlazariev avatar Nov 27 '19 13:11 oleksandrlazariev

@xutianming were you able to make it work?

Yes, it works and performs well if you have enough bandwidth between your training cluster and the remote storage.

xutianming avatar Nov 29 '19 05:11 xutianming

@xutianming how you did it? seems like DALI still not supports it

oleksandrlazariev avatar Nov 29 '19 09:11 oleksandrlazariev

@xutianming how you did it? seems like DALI still not supports it

I extended DALI by myself. I plan to make a pull request later.

xutianming avatar Nov 29 '19 09:11 xutianming

@xutianming I'm sure it would be great for the whole DALI's community!!!

oleksandrlazariev avatar Nov 29 '19 09:11 oleksandrlazariev

@xutianming Did you add the PR ?

goswamig avatar Jun 08 '20 19:06 goswamig

@xutianming Did you add the PR ?

Sorry for the delay. It's still working in progress

xutianming avatar Jun 09 '20 02:06 xutianming

so looking forward to this feature!

zw0610 avatar Aug 05 '21 08:08 zw0610

Hi. Is there any news on this feature?

dianaTanasa avatar Mar 07 '22 15:03 dianaTanasa

Hi @dianaTanasa,

I'm sorry, but there is none from the DALI team's side.

JanuszL avatar Mar 07 '22 16:03 JanuszL

Is there any ETA on this? Desperately need this.

vivekpayasi avatar Sep 10 '22 09:09 vivekpayasi

Hi All,

You can try out the external_source operator and see if the data loading performance implemented that way is sufficient (please check especially parallel parameter for the best performance).

JanuszL avatar Sep 11 '22 20:09 JanuszL