datahub
datahub copied to clipboard
New policies for storage archival process
Let's iterate on this proposal through this PR which is a follow-up to #3377 !
@yuvipanda Incorporated your feedback and pushed an update to the policy doc. Do review and merge the changes if it makes sense!
@balajialg Thank you for making some changes! I'm still struggling to understand the 'what is the problem we are trying to solve here?' question I framed in https://github.com/berkeley-dsep-infra/datahub/pull/3384#discussion_r876753925. And I'm not entirely sure what part of the changes addresses that. Our policies should look radically different based on what it is that we are trying to solve, so I'd love to frame our conversation around that.
@yuvipanda - Let me know if I am coming across clearly with the purpose of this policy proposal. The purpose of this policy proposal is to build transparency about our storage policy and process with all users who use Datahub. "All users" is the keyword here. I assume our goal (&probably the problem we as the infrastructure team want to solve for ourselves) at the start of the process is to revisit the storage policy from the first principles with the objective of making it more user-centric & reduce the effort and cloud costs involved (if possible). I know that this is a broad statement that has multiple objectives. I see this proposal as us documenting our exploration of the multiple policy options and finalizing the policy pathway forward for users and our future reference.
Articulating what our policy is, storing it in a place that is accessible for our users, and communicating this policy change to them at different stages of their engagement with Datahub - a) When they first log in to Datahub and b) If and when their storage needs exceed the threshold limit set and c) When their data is to get archived is important to build transparency with our users. From a user perspective, this proposal seeks to be the single source of truth with regard to our finalized storage policy. This policy should go hand in hand with the communication proposal you had outlined as part of PR #3388.
Given this context, Let me know if you have input on how I can reframe the below question based on the rationale outlined above (Supposing that the rationale outlined makes sense from your lens). Let me know! The policy proposal seeks to address the question "What is our policy proposal that needs to get transparently communicated when users stored more than the archival threshold (~100 GB of data) in their home directories?" The policy proposal seeks to address the question "What is our policy proposal that needs to get transparently communicated to our users so that they can understand how we handle their data from the time they log in to the time where their data gets archived"
@yuvipanda questions whether we are solving any problem by having a policy proposal for the 100 GB storage threshold. Considering that, John highlighted cloud costs are not a big concern at this juncture, and initiatives like #3389 would bring down cloud costs over a longer duration. His suggestion would be to focus on the communication of storage policies instead of adding more policy guardrails regarding storage!
@yuvipanda note that it's not just cloud storage we're concerned with here.
About half of the "compute engine" costs are for the persistent disks which are a concern for this policy.
Folks, It would be great if we iterate on this proposal and finalize our policy by the end of next week.
@balajialg @felder how about we automate running the archiver so it runs every week, and then for people with >100GB, we archive on 3 months of inactivity? That should help take it off the more expensive POSIX storage.
As for deletion, I'd say we can do something like 'your files will be deleted 18 months after they are archived' or something of that sorts, and enforce that consistently - along with the automated messages mentioned in #3388 so users are aware. I don't want us to delete user directories automatically because they exceeded some threshold, and not archiving them because they're big actually costs us more money.
So if the goal is to save more expensive on-disk storage, I propose that we run the archiver continuously (I'll have to redesign it slightly but doable), and if your homedir is >100GB your cutoff is 3 months rather than 6. How is that?
@yuvipanda I'm definitely open to the idea of running the archiver continuously. However yeah we'd need to consider that carefully. For instance I'd like to at some point have archived directories (on disk) get removed. As things stand now, removing a directory from disk also removes the ability for the owner of that data to know where it went. I figure we could either remove the data from archival storage at the same time (which implies after at least 12 months) or provide another method of retrieval if we want to keep the data in perpetuity.
Cool, I included the policy suggestion to archive files with size > 100 GB for the 90-day cut-off as part of the proposal. I assume the operational details @felder talked about are not within the scope of this policy proposal but should be discussed as part of any Github issues. We can revisit this proposal if there is any updated information. Can one of you merge this PR if this seems like a reasonable policy proposal?
@felder's 3 key reasons why defining this policy is extremely important NOW,
- [ ] Service Management headaches around growing and shrinking storage (~Jon's time)
- [ ] Cloud Costs (Half of the costs are storage related)
- [ ] Handling boundary cases where students store large files.
Thanks for the response, @balajialg. I agree these are all important problems to solve. I personally don't think we should be writing actual policy that treats users differently based on their home directory storage before they're even aware of current policies. My suggestion is that we try to get #3388 implemented this coming semester, and see how that goes - and table this particular policy until the next semester.
@yuvipanda Seems reasonable to me. I will let @felder take the final call on this as he will be most affected by this decision. @felder - What do you think about holding off this policy and reviewing this at the end of the semester (Let's assume that #2 is not a big headache during Fall 22)? I can postpone the scheduled meeting to sometime in December.
@balajialg seems reasonable.