datahub icon indicating copy to clipboard operation
datahub copied to clipboard

End-semester cleanup tasks [Clean up directories, upgrade image/packages and documentation]

Open yuvipanda opened this issue 4 years ago • 12 comments

At the end of each semester, we have had to do some cleanup (like with #2002) - but this hasn't happened in a structured way. For this semester, let's try make it so!

Here's a bunch of tasks we should do at end of each semester

To Do

  • [x] Clean up unused users in the hub database, as this can cause performance issues (such as https://github.com/berkeley-dsep-infra/datahub/issues/2677). Script for this at https://github.com/berkeley-dsep-infra/datahub/blob/staging/scripts/delete-unused-users.py (Open an issue for automating this task)
  • [ ] @yuvipanda to run the script to remove unused directories of users
  • [ ] Move base ubuntu image to the latest version 22.04 (@yuvipanda/ @felder)
  • [ ] Upgrade R packages to their latest versions

Definitely done

  • [x] Archive unused user home directories to claim back some space (#1633)
  • [x] Upgrade R and RStudio
  • [x] @balajialg to support drafting the updated archival policy which is public-facing
  • [x] @felder to upgrade the Otter package in the Public health hub to the latest version. @balajialg to communicate to the instructional team to rewrite the test cases.
  • [x] Upgrade to the latest package version of Python (@yuvipanda/ @felder)
  • [x] @felder to look at storage for hubs
  • [x] @balajialg Look at the python popularity dashboard + course enrollment data and bring insights to the group to make a decision related to package maintenance

yuvipanda avatar Oct 29 '21 07:10 yuvipanda

Also, how is the "end of each semester" defined? Guessing it is some number of days after finals, with sufficient time to prepare the hubs for the next semester. We should declare this to instructors so that they know how much time they have to deal with special cases. e.g. students who have been approved to submit late projects.

For context: Fall 2021's last day of finals is on 12/17 and Spring 2022 begins on 1/11. (25 days in between, modulo curtailment days) Spring 2022's last day of finals is on 5/13 and Summer Session A begins on 5/23. (10 days in between) Summer 2022's last day (no "finals week") is 8/12 and Fall 2022 begins on 8/17 (5 days in between).

I'm citing semester start date and not first day of classes, because we can't expect instructors to start testing their assignments on the first day of classes. We should expect images to be developed and libraries to be added in between the first day of the semester and the first day of classes.

ryanlovett avatar Oct 29 '21 17:10 ryanlovett

I also think packages for previous semester's classes should be removed. Carrying forward dependencies one or two semester's old would be safer, but also cruftier. Unused packages can create dependency problems for newer ones, and increase build time. Maybe this should be resolved by communicating with whoever requested the additions, e.g. "we will remove your packages unless you still need them," or just broadly, "datahub is removing specialized libraries, please submit your requests for next semester."

ryanlovett avatar Oct 29 '21 18:10 ryanlovett

I agree with @ryanlovett

Also we should consider a similar policy with regard to admin rights as well other special requests such as additional resources. Ideally each semester, we should get confirmation that any deviations from what is provided by the base image are still required.

felder avatar Oct 29 '21 19:10 felder

With regard to bumping versions, I wonder if it wouldn't be wise to have another hub where we do this a semester in advance. This could also apply to bumping ubuntu as well (LTS versions change every 2 years...for that matter if we do this we may not want LTS).

If we could get instructors to test next semester's classes in some sort of staging hub during the current semester, maybe we could start the next semester off with fewer compatibility issues/package requests.

felder avatar Oct 29 '21 19:10 felder

Great points! For package requests, should we compile the specialized list of packages we plan to remove across each hub (both python/R) and send an update email to instructors/teaching team using datahub-infra email #2197 by the end of the semester? Then, possibly, give them a 2-3 week window to make requests to retain the required packages using the usual package request process?

balajialg avatar Oct 29 '21 20:10 balajialg

@balajialg I'm really not sure what the best approach is from a customer standpoint.

I feel 2-3 weeks may not be enough time. This approach would really come down to when we feel comfortable that the current semester's issues are largely resolved while also providing some time to stage major changes such as upgrading rstudio.

Then we have a feature freeze on a specific date and open it up for next semester's customers to test. Essentially having our customers do QA as well as get their requests in. I'd think a month would be better for that. For example Fall feature freeze in this context could be scheduled for early December with a scheduled rollout of the updated hub image in early January after Winter break.

felder avatar Oct 29 '21 20:10 felder

I agree that removing libraries would be killer. I just discovered for example that tensorflow doesn't even import on latest datahub, and could've been removed as nobody has so far complained about it! I think the python popularity contest setup will definitely help with removing libraries with more confidence. Would be great to do a purge at the end of this semester.

We will definitely need a way to communicate all this to various instructors tho. I'm rooting for https://github.com/berkeley-dsep-infra/datahub/issues/2855 - I don't think we currently have any way to 'broadcast' messages to our users.

The thing with package versions is that they move pretty fast - a 6 month old version of an actively maintained package can be pretty stale. So given we try to not bump versions during a semester, mass bumping them as close as possible (while still preventing surprises) is the way I'd like us to go. In the scientific python ecosystem at least, I don't think stability is something we can easily attain by moving slower - moving faster is a better bet. One entire semester's lead time is definitely too long I think.

While this is true in the general case, we do have a lot of special cases - RStudio being an important one. In practice, we (@ryanlovett really) is the primary developer of the JupyterHub RStudio integration, and we're one of the biggest users. So the only way for it to really progress is for us to deploy and find bugs. I do agree we should prioritize that and get it out as early as possible to get them to be better tested. mybinder.org is probably enough during the early times, and we can definitely deploy them to our hubs in a staggered fashion.

yuvipanda avatar Oct 29 '21 21:10 yuvipanda

@yuvipanda the idea would be to do the version locking close to the feature freeze date. So not a semester in advance, but maybe a month..ish... in advance.

That's not to say we couldn't do version bumps during the semester as well. I anticipate we will get and continue to service those types of requests. It'd be nice, however, for the bulk of the changes to be done prior to asking instructors to hop on and test.

felder avatar Oct 29 '21 21:10 felder

@felder Month-long lead time for faculty to make decisions sounds good to me!

@yuvipanda Answering the communication part without digressing this conversation. We can consider the newsletter as one of our outreach mechanisms.

However, Having done outreach to faculty using newsletters in my previous role, I am slightly skeptical about this idea. I found the engagement to be shallow, even in terms of opening the newsletter as faculty get bombarded with so many newsletters in a week. Asking instructors to respond based on a newsletter can be a stretch. I will also seek Eric's inputs to decide the way forward.

balajialg avatar Oct 30 '21 00:10 balajialg

@ryanlovett @ericvd-ucb FYI, End semester tasks that are scoped for this sprint!

balajialg avatar May 05 '22 20:05 balajialg

Thanks @balajialg !

ryanlovett avatar May 05 '22 21:05 ryanlovett

Scoped for the month of August (Between the end of Summer and the start of the Fall Semester, 22)

balajialg avatar Jul 09 '22 02:07 balajialg