datahub icon indicating copy to clipboard operation
datahub copied to clipboard

New policies for storage archival process

Open balajialg opened this issue 3 years ago • 15 comments

Let's iterate on this proposal through this PR which is a follow-up to #3377 !

balajialg avatar May 13 '22 02:05 balajialg

@yuvipanda Incorporated your feedback and pushed an update to the policy doc. Do review and merge the changes if it makes sense!

balajialg avatar May 20 '22 20:05 balajialg

@balajialg Thank you for making some changes! I'm still struggling to understand the 'what is the problem we are trying to solve here?' question I framed in https://github.com/berkeley-dsep-infra/datahub/pull/3384#discussion_r876753925. And I'm not entirely sure what part of the changes addresses that. Our policies should look radically different based on what it is that we are trying to solve, so I'd love to frame our conversation around that.

yuvipanda avatar May 23 '22 14:05 yuvipanda

@yuvipanda - Let me know if I am coming across clearly with the purpose of this policy proposal. The purpose of this policy proposal is to build transparency about our storage policy and process with all users who use Datahub. "All users" is the keyword here. I assume our goal (&probably the problem we as the infrastructure team want to solve for ourselves) at the start of the process is to revisit the storage policy from the first principles with the objective of making it more user-centric & reduce the effort and cloud costs involved (if possible). I know that this is a broad statement that has multiple objectives. I see this proposal as us documenting our exploration of the multiple policy options and finalizing the policy pathway forward for users and our future reference.

Articulating what our policy is, storing it in a place that is accessible for our users, and communicating this policy change to them at different stages of their engagement with Datahub - a) When they first log in to Datahub and b) If and when their storage needs exceed the threshold limit set and c) When their data is to get archived is important to build transparency with our users. From a user perspective, this proposal seeks to be the single source of truth with regard to our finalized storage policy. This policy should go hand in hand with the communication proposal you had outlined as part of PR #3388.

Given this context, Let me know if you have input on how I can reframe the below question based on the rationale outlined above (Supposing that the rationale outlined makes sense from your lens). Let me know! The policy proposal seeks to address the question "What is our policy proposal that needs to get transparently communicated when users stored more than the archival threshold (~100 GB of data) in their home directories?" The policy proposal seeks to address the question "What is our policy proposal that needs to get transparently communicated to our users so that they can understand how we handle their data from the time they log in to the time where their data gets archived"

balajialg avatar May 23 '22 15:05 balajialg

@yuvipanda questions whether we are solving any problem by having a policy proposal for the 100 GB storage threshold. Considering that, John highlighted cloud costs are not a big concern at this juncture, and initiatives like #3389 would bring down cloud costs over a longer duration. His suggestion would be to focus on the communication of storage policies instead of adding more policy guardrails regarding storage!

balajialg avatar May 26 '22 21:05 balajialg

@yuvipanda note that it's not just cloud storage we're concerned with here.

About half of the "compute engine" costs are for the persistent disks which are a concern for this policy.

felder avatar May 26 '22 22:05 felder

Folks, It would be great if we iterate on this proposal and finalize our policy by the end of next week.

balajialg avatar May 28 '22 00:05 balajialg

@balajialg @felder how about we automate running the archiver so it runs every week, and then for people with >100GB, we archive on 3 months of inactivity? That should help take it off the more expensive POSIX storage.

yuvipanda avatar May 28 '22 05:05 yuvipanda

As for deletion, I'd say we can do something like 'your files will be deleted 18 months after they are archived' or something of that sorts, and enforce that consistently - along with the automated messages mentioned in #3388 so users are aware. I don't want us to delete user directories automatically because they exceeded some threshold, and not archiving them because they're big actually costs us more money.

yuvipanda avatar May 28 '22 05:05 yuvipanda

So if the goal is to save more expensive on-disk storage, I propose that we run the archiver continuously (I'll have to redesign it slightly but doable), and if your homedir is >100GB your cutoff is 3 months rather than 6. How is that?

yuvipanda avatar May 30 '22 10:05 yuvipanda

@yuvipanda I'm definitely open to the idea of running the archiver continuously. However yeah we'd need to consider that carefully. For instance I'd like to at some point have archived directories (on disk) get removed. As things stand now, removing a directory from disk also removes the ability for the owner of that data to know where it went. I figure we could either remove the data from archival storage at the same time (which implies after at least 12 months) or provide another method of retrieval if we want to keep the data in perpetuity.

felder avatar May 31 '22 19:05 felder

Cool, I included the policy suggestion to archive files with size > 100 GB for the 90-day cut-off as part of the proposal. I assume the operational details @felder talked about are not within the scope of this policy proposal but should be discussed as part of any Github issues. We can revisit this proposal if there is any updated information. Can one of you merge this PR if this seems like a reasonable policy proposal?

balajialg avatar Jun 06 '22 18:06 balajialg

@felder's 3 key reasons why defining this policy is extremely important NOW,

  • [ ] Service Management headaches around growing and shrinking storage (~Jon's time)
  • [ ] Cloud Costs (Half of the costs are storage related)
  • [ ] Handling boundary cases where students store large files.

balajialg avatar Jul 07 '22 21:07 balajialg

Thanks for the response, @balajialg. I agree these are all important problems to solve. I personally don't think we should be writing actual policy that treats users differently based on their home directory storage before they're even aware of current policies. My suggestion is that we try to get #3388 implemented this coming semester, and see how that goes - and table this particular policy until the next semester.

yuvipanda avatar Jul 08 '22 02:07 yuvipanda

@yuvipanda Seems reasonable to me. I will let @felder take the final call on this as he will be most affected by this decision. @felder - What do you think about holding off this policy and reviewing this at the end of the semester (Let's assume that #2 is not a big headache during Fall 22)? I can postpone the scheduled meeting to sometime in December.

balajialg avatar Jul 08 '22 19:07 balajialg

@balajialg seems reasonable.

felder avatar Jul 13 '22 19:07 felder