repology-updater Delta updates plan

Delta updates were planned for years, but still not implemented as it's proven to be too huge task to do in one go. We need a new plan to gradually implement it in smaller steps. Here it is:

[x] Move database update logic from database_update in repology_update.py into dedicated module
[x] Split update_finish query into a lot of smaller subtasks. This would also help to profile them
[x] Implement package checksums
[ ] Revisit package deduplication to never allow multiple packages with equal checksums
- [x] Add check against packages with duplicate checksums (when uploading single packageset)
[x] Add tracking of packageset checksums in the database
[x] Add comparing packageset checksums to the database and cease to touch unchanged packages. This should yield first performance and diskspace gains
Start converting derived object generation onto deltas (2 top priority items, then sorted by decrease of update time impact)
- [x] Problems. ~This should too yield performance gains as it would eliminate need to call PackageFormatter from problems view~ (not doing it here)
  - ~Update the view in webapp~ (not doing it here)
- [x] Events/history
  - [x] Allows to eliminate maintainer_repo_metapackages table and SQL code for its update, and do drop version information from projects/metapackages table
- [x] Binding tables for search
- [x] URL relations
- [x] Projects (metapackages)
- [x] Maintainers
- [x] Links
- [x] Repositories
- [x] Global statistics (partial - the important part is to avoid scanning packages table to get the number of packages)
- [x] Redirects
Start to store huge Package fields externally
- [ ] Maintainers. Store array of maintainer IDs instead of array of maintainer names
- [x] Homepages/Downloads. The same (IDs need to be added to links table first)
- This also allows to add more items to Package and not store them in the database, but possibly use in other ways
  - [ ] Store warnings from PackageTransformer in the Package and convert them into problems (but not store in the DB)
  - ~Store name changes (#815) the same way~
~Add foreign key constraints to the database to prevent consistency problems~ no need to do it here, or probably at all
[ ] Add a way to force hash invalidation (e.g. a way to affect hashing seed) so add derived objects can be recalculated if requested
[x] Reuse remaining big queries for validation (and repair) of aggregate data produced by delta updates
- [x] Repositories
- [x] Maintainers
- [x] Global statistics

Dec 16 '19 18:12 AMDmi3

Further development that this change would unlock, even with partial implementation:

More interesting freshness metrics #62
Multiple categories support #76
Better packagelinks/homepages handling #119, #195, #476, #722, #794, #948, #950
Problems improvements #196
Storage optimization #360, #549
Track packages moving between projects #527
Better history support - quantity and consistency #546, #612
Simplify update by removing post-update phese #610
Complex package transformations/merging #637
Complete support for name redirects #815, #980
Dependency tracking
Patch tracking

Dec 17 '19 20:12 AMDmi3

TODO: decide on handling circular data dependencies and asynchronous updates. For instance, links:

since we no longer have reliable last_seen times, there's no way to know how long a link has not been seen. The solution is to either mark used links with bulk query once a week or so, or keep a reference count
problems regarding links 'dead for a month' are created and removed asynchronously on link checks (when link status update) in addition to project updates (when links change). The former case can probably be handled with bulk query once a few days too.

Other cases include:

related project flags
number of maintainers and problems for repository
repositories for maintainer

Dec 17 '19 23:12 AMDmi3

For the note, first phase of delta updates is currently being deployed. The development resulted in major optimizations in other places, namely in repoproc deserializer, repositories table update (bad SQL execution plan led to 40x overhead, 25% of overall database update time), and repo_metapackages table update (thrashing due to inserting unordered items led to excess I/O and several extra minutes to update).

Unfortunately, some inevitable pessimizations were introduced too, which affect update time when everything is updated (such as first update on an empty database). Extra overhead is about 20%, +2 minutes for hashing (current package hash implementation which involves JSON may be improved) and +2 minutes for extra database queries. Of course, these are outweighted by partial update performance improvement (~2.5x currently).

Jan 10 '20 19:01 AMDmi3

Status update: first phase of delta updates allowed to do update cycle in under 50 minutes (most time spent in fetch and parse).

The next step would be performance testing of update process which has to fetch previous state of packages from the database - there seem to be no way to avoid it if we want to generate history properly. If it turns out to be too slow, stored procedures may be investigated.

After that, we'll need to convert derived table updates without introducing huge pessimization when most packages change. There are two distinct kinds of such derived tables:

tables indexed by project name (metapackages, *_metapackages)
tables indexed by orthogonal objects (maintainers, links)

The former do not impose any limitations and may be updated in parallel to packages, with the same granularity. The latter lead to write amplification and n² patterns when done naively, e.g. when a lot of project updates modify single maintainer/repository/link.

The most important case is when e.g. maintainer IDs need to be known before updating binding tables or packages themselves (e.g. when we switch to storing maintainer/link IDs instead of verbatim texts).

The solution here is to cache stuff in memory and avoid updating referenced objects multiple times. That is, store maintainer-id mapping in memory, and use it to get IDs and decide whether maintainers need to be created. If it's too big for the memory, we can flush it (completely, or more optimally only some least recently used entries) periodically.

Regarding tables indexed by project name, the important thing is not to duplicate code for bulk and discrete updates. This may be achieved by using a temporary table which lives for the duration of update transaction and holds updated project names. With it we can use the same bulk queries, but limit them with a subset of projects. If the subset is big enough, it may be ignored and bulk update used.

Jan 15 '20 00:01 AMDmi3

Binding tables done, -30% database update time. Next big time consumer is url_relations. There are ways to optimize both url_relations table construction and updating has_related flag in metapackages.

By the way, metapackages table is around 4x of it's minimal size (as can be seen after vacuum), which is caused by multiple consequent complete updates, which leads to N dead copiles for each tuple. Apart from only updating affected projects, optimization is to introduce conditions to UPDATE queries not to needlessly rewrite each row on each pass.

Also the shortcoming of current database schema is usage of last_seen fields (in metapackages, maintainers, links). For projects and maintainers, there should instead be orphaned_at field which could be set when all related packages disappear. For links, there could be a similar flag which we could update weekly by discovering all unreferenced links. After some time, these could be removed.

Jan 29 '20 17:01 AMDmi3

repology-updater repology-updater copied to clipboard

Delta updates plan

repology-updater
repology-updater copied to clipboard