repology-updater
repology-updater copied to clipboard
Track project releases properly
Currently, when generating project history, we use time of the latest project release to calculate release periods and package update lags. It is stored in metapackage
table and calculated with a trigger. Since the trigger is going to go away with #956, and the way the time was calculated was not really accurate, an improved mechanism is needed.
Current logic:
- If (any of) updated devel/newest version is greater than (any of) older devel/newest AND the package with this version was seen in its repository before, consider the new version actual devel/newest
- If (any of) updated devel/newest version is lesser than (any of) older devel/newest, reset actual devel/newest version (that is, set is as "unknown", and history will not calculate any time intervals based on it)
Things to consider:
- There are currently basically 4 parallel version branches supported: newest/devel * normal/altver; altver is not currently handled
- Version which has once newest may be ignored or removed. This is currently handled by the second condition in the logic explained above, however history stays affected by the incorrect release
- In fact, we need to keep a complete release history in order to be able to calculated mean outdatedness and update time (#62)
So, the idea is to keep a complete history of project releases as seen by Repology. Open questions:
- Should we only take newest, or all versions?
- New versions may appear among the outdated ones, e.g. when legacy branch is updated
- These are however likely to contain garbage like old snapshots we don't care of since it's outdated anyway
- We could look into flags to check if there was an attempt to ignore version
- However, it would be best to at least have repology/repology-rules#20
- How much should we care about incorrect data?
- Are we going to show release dates to users, are we going to use them in history, or are we going to only use them for statistics?
- Do we need to ignore versions from newly introduced repositories at all?
- As per how it's done now it's not reliable anyway, as repository may be replaced completely (e.g. Debian) or extended by adding another source/subrepository.
- We may just use the most accurate time we have, but have it always.
Do we need to ignore versions from newly introduced repositories at all?
Turns out it is completely pointless. For instance, even if we mark all start dates as untrusted on first packages import, any following update would be recorded as legit, however it would be in fact even more inaccurate.
So, the idea is to implement it along with #527. In fact, it seems to be the only reliable way to ignore and unignore versions post-factum. #527 would provide granular facts on project versions in each repo, with dates and whether these were ever ignored. These facts may be (re-)merged into aggregate release history with each update, incorporating all positive and negative changes at once.
This aggregated data may be used to calculate lags on the fly. In fact, it may partially replace the history as well.
It looks like devel/newest and altver handling may be avoided as well, because to determine lag one only needs to find the closest version below given one. However, postgresql-libversion needs to be taught to handle version flags in order to do that.
Aggregation by version done here needs some extra attention:
- It needs to be aware of version comparison. E.g. it needs to treat
1.0
and1.0.0
as single version and group it into a single release. - It does not need to be aware of flags when aggregating by version on repository level, as version has a fixed meaning within a single repository, and we don't to treat version flag changes as new releases
- However it needs to be aware of flags when aggregating into releases as it's needed for proper sorting and differentiating similarly looking versions with different meaning (
1p1
<1
< p_is_patch1p1
)
postgresql-libversion changes are needed to implement these.