ppx icon indicating copy to clipboard operation
ppx copied to clipboard

Ability to delete local cloudpath cache after upload

Open lazear opened this issue 2 years ago • 2 comments

I found myself downloading a large amount of data from PRIDE (PXD004452) on a small EC2 instance (64 GB disk space) with the goal of directly transferring the data to an S3 bucket (I have done this several times, I ❤️ ppx). I have always just started a small instance with minimal disk space, because I figured that since I was just directly transferring to S3 it wouldn't matter... This is not the case though! I am out of memory due to cloudpath local caching.

If I delete the files in the /tmp directory, I can free up space and try to resume the search - but when I restart the search, the completed raw files are re-synced back to the /tmp directory. I think there should be a way (based on issues linked below) to manually delete the locally cached file after upload - not sure how it works for a re-started search. I can try and take a stab at this if it's something you feel could be supported in ppx. This is probably too specialized to be upstreamed to cloudpath - I would say raw files downloaded from PRIDE/etc are immutable and we don't need to worry about syncing changes from local to cloud - just whether the file is synced between cloud storage & repository.

https://cloudpathlib.drivendata.org/stable/caching/

https://github.com/drivendataorg/cloudpathlib/issues/233 https://github.com/drivendataorg/cloudpathlib/issues/153

lazear avatar Nov 15 '22 03:11 lazear

Hmm, I can also resolve the problem by just performing downloads in smaller chunks - so perhaps low priority, since this is probably an unusual use case.

lazear avatar Nov 15 '22 04:11 lazear

Interesting - I'll have to look into this. Thanks for bringing to my attention!

wfondrie avatar Nov 16 '22 00:11 wfondrie