libdnf icon indicating copy to clipboard operation
libdnf copied to clipboard

Countme should report system age, not repository age

Open dmnks opened this issue 1 year ago • 14 comments

Currently, when we compute the system's age bucket (1 through 4) to report in the weekly countme flag, we do that relative to the first-ever metadata refresh (called the epoch) of the respective repository. However, the original proposal intended that it would be the absolute age bucket, that is, since the installation.

This is because we store the cookie files (containing the timestamps) in per-repository directories (persistdir) whose names contain hashes derived from various repository properties including the releasever value. That means, the system's age bucket is effectively reset on each Fedora system upgrade which is not what we want.

To fix this, we should simply keep one single cookie file for the entire system and use that to determine the system's age bucket.

There's a second countme implementation in rpm-ostree (here's why) which reportedly does the right thing. Looking at the code, they do appear to store only one cookie file per system (at /var/lib/rpm-ostree-countme/cookie), as it should be. I think we should just do the same.

To avoid skewing the metrics, the fix should probably include a check for an old, repo-specific cookie file and if it exists, it should load the values from it and then remove the file. When it comes to storing the new values at the end of the addCountmeFlag() function, that should already go into the system-wide cookie file. That way, systems that upgrade to the fixed DNF version would simply continue where they left off, instead of being reset to age 1. Note that this may need special care in case repositories are fetched in parallel.

dmnks avatar Jul 28 '23 18:07 dmnks

To avoid skewing the metrics, the fix should probably include a check for an old, repo-specific cookie file and if it exists, it should load the values from it and then remove the file.

Probably the "best" thing to do is find the oldest countme file (including disabled repos).

Hacky but maybe more accurate — does dnf create any other files in /var or /etc at install time that would likely have a corresponding file date which could be used.

Both of these will probably cause "jumps" in my data — but I'm okay with that, really.

mattdm avatar Jul 29 '23 09:07 mattdm

I think we could use the transaction ID 1 in the DNF history database which, I believe, represents the fresh install through Anaconda. The transaction record contains the timestamp. On the CLI, you can check that with:

dnf history info 1 | grep 'Begin time'

That way, we wouldn't need to store the "epoch" in the cookie file, and would just always use the above timestamp for that.

dmnks avatar Jul 29 '23 10:07 dmnks

Thinking about it more, the first-ever transaction may not be a reliable indicator of the system age for ephemeral systems that are not installed through Anaconda but from an image (e.g. Podman containers). So we may need a different strategy (for those).

dmnks avatar Jul 29 '23 11:07 dmnks

There's more systems that are not installed through Anaconda (the ARM version often gets installed from an image, virtual machines at cloud providers, etc) so I wouldn't special case it :)

supakeen avatar Jul 31 '23 06:07 supakeen

Thanks, that's a useful data point to have :smile:

dmnks avatar Jul 31 '23 09:07 dmnks

Just FTR, @james-antill suggested in a chat that one solution would also be keeping per-repo countme files but doing that in directories named after the repo ID only (not a hash).

dmnks avatar Jul 31 '23 19:07 dmnks

Just as FYI, here is the implementation in rpm-ostree that does not have this issue: https://github.com/coreos/rpm-ostree/blob/main/rust/src/countme/cookie.rs

travier avatar Aug 07 '23 13:08 travier

Is there any movement on this? What is the implementation like in DNF 5?

mattdm avatar Apr 01 '24 16:04 mattdm

What is the implementation like in DNF 5?

I've just checked it, it's basically a clone of the dnf4 implementation.

Is there any movement on this?

We'll discuss it with leadership and the team in the following days and provide feedback soon.

jan-kolarik avatar Apr 05 '24 08:04 jan-kolarik

I've just checked it, it's basically a clone of the dnf4 implementation.

Ah, bug-for-bug compatibility. :)

Am I possibly currently getting double-counts from people using both, or using e.g. GNOME Software + dnf5 in f39?

mattdm avatar Apr 05 '24 19:04 mattdm

If dnf4 and dnf5 both use a different repo "persistdir", then yep, we're likely double-counting already.

This is really silly and needs to be fixed ASAP. Since I wrote that code (and still remember how it works, kinda), it just makes sense for me to have a closer look, then... So I'll do just that, assigning to myself now.

dmnks avatar Apr 08 '24 11:04 dmnks

If dnf4 and dnf5 both use a different repo "persistdir", then yep, we're likely double-counting already.

Good news, I guess. I've just checked and dnf5 uses the same persistent directories (/var/lib/dnf/repos/<repoid>-<hash>/) as dnf4, meaning that countme flags are not sent twice for each.

dmnks avatar Apr 09 '24 08:04 dmnks

TL;DR: A simple fix is underway. I'll be on PTO next week, so expect silence here until I'm back.

Having thought about this more, we do need to continue tracking the countme timestamps ("cookie" files in /var/lib/dnf/repos/) on a per-repo basis, as opposed to having one system-wide timestamp. This is simply because the countme flag is reported per-repo (via the metalink URL) and using a system-wide cookie would cause only one repo (whichever happens to be fetched first by dnf) to issue the flag each week, which is not what we want.

However, what we do want to change is so that the timestamps aren't dependent on the $releasever value as that value is part of the metalink URL (which is used to compute the hash). Therefore, the easiest fix is to just change the per-repo directory names in /var/lib/dnf/repos/ from <repoid>-<hash> to <repoid>. This was also mentioned above as one of the possible solutions.

I have a working (one-line) patch for that locally, as well as an updated countme.feature test to cover this. So that part is easy.

The tricky part is to ensure that the cookie is not reset when the existing systems upgrade to the fixed libdnf version (once released). Since the directory name changes, libdnf would think that the system doesn't yet have a cookie file and thus 1) would start over, with age set to 1 (countme=1), as if the system was just freshly installed, and 2) would possibly send the flag again in the same week, thus double-counting the system in that week. This would skew the metrics we gather on the server quite a bit.

To prevent that, the cookie file needs to stay the same when you upgrade libdnf to the fixed version, as well as if you decide to downgrade to the old version for some reason. The easiest solution to that seems to be the following:

  1. In a (%post?) scriptlet in libdnf, check whether we have an existing cookie for the main repos ("fedora" and "updates"?).
  2. If we do, create a non-hash symlink for each of those repos. For example, if /var/lib/dnf/repos/fedora-845d89688cb28f31 exists, a symlink named /var/lib/dnf/repos/fedora pointing to the former would be created by the scriptlet.

This way, the same cookie file would be reused after upgrading to the new libdnf version as well as after downgrading it.

What the scriptlet needs to decide, though, is which directory to choose for the symlink target if there are multiple - that can happen easily, such as if dnf --releasever is ever used on the system.

I think it should choose the one that corresponds to the running Fedora version, e.g. by looking at /etc/os-release (VERSION_ID). This is quite easy to do, the hash is a SHA256 of the metalink URL so we can compute that easily in the scriptlet using core-utils programs.

In fact, I also have a draft scriptlet locally which works as described above, we just need to decide on which repositories to "migrate". I'd think "fedora" and "updates" should suffice, but please let me know otherwise.

So, that's for a status update. I've decided to dump my thoughts here because I'll be on vacation next week and might otherwise forget the details :smile: Any feedback is of course welcome in the meantime. Just know that I'll only be able to respond when I'm back.

dmnks avatar Apr 12 '24 15:04 dmnks

Hacky but maybe more accurate — does dnf create any other files in /var or /etc at install time that would likely have a corresponding file date which could be used.

Not dnf, but there's the /etc/machine-id file which, actually, seems to fit the bill quite perfectly. Its modification timestamp typically reflects the installation time as the ID is generated during system installation or first boot (by systemd) and then stays untouched.

So, scratch my above ponderings about changing the persistdir naming scheme. Instead, I've submitted https://github.com/rpm-software-management/libdnf/pull/1662 which switches age counting to the machine-id file's timestamp.

Here's an updated BDD feature file which demonstrates the new logic (see the Examples table at the bottom of the Scenario Outline): https://github.com/rpm-software-management/ci-dnf-stack/blob/ab365d2bad19f69e188fb449fb6bcdd8834f5815/dnf-behave-tests/dnf/countme.feature#L44

dmnks avatar May 09 '24 11:05 dmnks