ppx
ppx copied to clipboard
Ability to delete local cloudpath cache after upload
I found myself downloading a large amount of data from PRIDE (PXD004452) on a small EC2 instance (64 GB disk space) with the goal of directly transferring the data to an S3 bucket (I have done this several times, I ❤️ ppx). I have always just started a small instance with minimal disk space, because I figured that since I was just directly transferring to S3 it wouldn't matter... This is not the case though! I am out of memory due to cloudpath local caching.
If I delete the files in the /tmp directory, I can free up space and try to resume the search - but when I restart the search, the completed raw files are re-synced back to the /tmp directory. I think there should be a way (based on issues linked below) to manually delete the locally cached file after upload - not sure how it works for a re-started search. I can try and take a stab at this if it's something you feel could be supported in ppx. This is probably too specialized to be upstreamed to cloudpath - I would say raw files downloaded from PRIDE/etc are immutable and we don't need to worry about syncing changes from local to cloud - just whether the file is synced between cloud storage & repository.
https://cloudpathlib.drivendata.org/stable/caching/
https://github.com/drivendataorg/cloudpathlib/issues/233 https://github.com/drivendataorg/cloudpathlib/issues/153
Hmm, I can also resolve the problem by just performing downloads in smaller chunks - so perhaps low priority, since this is probably an unusual use case.
Interesting - I'll have to look into this. Thanks for bringing to my attention!