open-vsx.org icon indicating copy to clipboard operation
open-vsx.org copied to clipboard

Procedure for the clean up of the database

Open chills-eclipse opened this issue 1 year ago • 18 comments

Hi John. We need a procedure on how to clean up the database from time to time to prevent the size to become uncontrollable. Can you please provide us the steps on what we can clean etc. Thanks

chills-eclipse avatar Aug 14 '24 20:08 chills-eclipse

See also: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/4157#note_1703626

denisroy avatar Aug 14 '24 20:08 denisroy

@chills-eclipse Yep, something we need to get back to. @amvanbaren Thoughts?

kineticsquid avatar Aug 15 '24 13:08 kineticsquid

@kineticsquid You want to move forward on https://github.com/eclipse/openvsx/pull/888?

amvanbaren avatar Aug 21 '24 09:08 amvanbaren

@amvanbaren MS doesn't limit extension versions. I'd like to see what other options we have.

The biggest offender is the file table. And there appear to be two API calls that could reference the table,

  • /api/.../logo/...
  • /api/.../file/...

Are there other API calls that reference the table?

I'm wondering if then consider where these API calls are coming from, we might be able to limit the size of the able. Where are calls to this API originating?

  • open-vsx.org UI
  • IDEs like Theia
  • npx ovsx commands like get

In this Gitlab issue, https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/4797, @denisroy provided a dump of the access logs and I took a look at the use of the API

kineticsquid avatar Aug 21 '24 18:08 kineticsquid

@amvanbaren Now that we've completed https://github.com/eclipse/openvsx/pull/888, I'm thinking we need a script to scrub and reduce the size of the file resources table. And, for safety's sake, should probably take a backup first. And given that, I'm thinking we need a read only mode, or at least a mode that disables publishing while we work on the table. Thoughts?

kineticsquid avatar Jan 31 '25 14:01 kineticsquid

Yes, now that https://github.com/eclipse/openvsx/pull/1045 has been completed we can remove resource files from the file_resource table. This can be done through a migration similar to https://github.com/eclipse/openvsx/blob/master/server/src/main/java/org/eclipse/openvsx/migration/SetPreReleaseJobRequestHandler.java

amvanbaren avatar Feb 03 '25 11:02 amvanbaren

@amvanbaren I see the code, but I'm not sure I understand how the pieces fit together. Are you saying that we run this for each extension? If so, does that remove the need for R/O mode or to pause publication?

kineticsquid avatar Feb 03 '25 16:02 kineticsquid

@tfroment FYI

kineticsquid avatar Feb 03 '25 16:02 kineticsquid

@kineticsquid It's an example of a migration job that runs in the background. It is for when extra processing is needed that can't be done in a SQL migration script. We can run a similar migration job for each FileResource that is of type resource where it first deletes the file in storage and then deletes the FileResource entity in the database.

Yes, there's no need for a read-only mode or to pause publication.

amvanbaren avatar Feb 03 '25 21:02 amvanbaren

@amvanbaren Understood. Do we need to first change how we process published extensions to not create entries for all of these files, or have we already made that improvement?

kineticsquid avatar Feb 04 '25 15:02 kineticsquid

@kineticsquid In https://github.com/eclipse/openvsx/pull/1100 resources are no longer processed and existing resources are deleted.

amvanbaren avatar Feb 06 '25 10:02 amvanbaren

@amvanbaren Got it. Do we have any more work to do on this one then after https://github.com/eclipse/openvsx/pull/1100 is deployed?

kineticsquid avatar Feb 06 '25 15:02 kineticsquid

Hi @amvanbaren any update on this? should we close?

tfroment avatar Mar 26 '25 10:03 tfroment

The file_resource table should now be way smaller (migration has completed). The migration_item table can be cleaned up, but it's better to do that through a db migration.

It looks like failed JobRunr jobs are not automatically deleted from the jobrunr_jobs table. There are configuration properties for org.jobrunr.background-job-server.delete-succeeded-jobs-after and org.jobrunr.background-job-server.permanently-delete-deleted-jobs-after, but nothing for failed jobs. @chills-eclipse you can check by running SELECT id, state, createdat, scheduledat FROM jobrunr_jobs ORDER by createdat LIMIT 100;

amvanbaren avatar Mar 26 '25 12:03 amvanbaren

@amvanbaren do you want the output here in the comments?

chills-eclipse avatar Mar 27 '25 11:03 chills-eclipse

@chills-eclipse No, that query was for your convenience.

amvanbaren avatar Mar 31 '25 07:03 amvanbaren