operations icon indicating copy to clipboard operation
operations copied to clipboard

There's still a need to bump the memcache size

Open jidanni opened this issue 1 year ago • 26 comments

Hello. In https://github.com/openstreetmap/openstreetmap-website/issues/2457 I was told to open an issue here. But as it is getting a little over my head, I will just leave this here.

jidanni avatar Jun 27 '24 08:06 jidanni

There is no evidence at all in the graphs that this in fact an issue. I definitely see the issue that you are referring to but I am unable it as all evidence says it shouldn't be down to memcache.

tomhughes avatar Jul 01 '24 18:07 tomhughes

Could these sessions disconnections be caused by server restarts or does the server never restart?

tbertels avatar Jul 06 '24 11:07 tbertels

I don't see any server restart in the stats, at least for the last 6 months: https://prometheus.openstreetmap.org/d/l4zgNUdMz/memcached?orgId=1&refresh=1m&from=now-6M&to=now

Also, the OP didn't provide any details how frequently they have to log in again. There might be external factors, like cookies being removed by the browser or some browser extension, etc.

mmd-osm avatar Jul 06 '24 12:07 mmd-osm

I thought everybody else also has to login again at least once every three or four days. Maybe it's because I use various browsers on various devices. But why on the same device do I need to login again after three or four days? Anyways welcome to check the logs to see why user jidanni has to login again so often.

jidanni avatar Jul 07 '24 00:07 jidanni

Which stat do you use to check if the server restarted? Aren't these sudden drops in memory usage symptoms of a server restart? Note that the dates are in the format month/day. Copie d'écran_20240707_143049m

tbertels avatar Jul 07 '24 12:07 tbertels

Ah, the link wasn't that helpful. There are about 11 memcached instances overall. However, for the 3 frontend servers, only 3 memcached instances (spike-06 ... spike-08) are relevant. Items in cache and memory usage are fairly stable for these three.

https://prometheus.openstreetmap.org/d/l4zgNUdMz/memcached?orgId=1&refresh=1m&from=now-6M&to=now&var-instance=spike-06&var-instance=spike-07&var-instance=spike-08

I think this should match the following config in chef: https://github.com/openstreetmap/chef/blob/45dc24b65b23a6c1dcc2f0ba2aa971563555c35e/roles/web.rb#L20

mmd-osm avatar Jul 07 '24 16:07 mmd-osm

A restart would indeed lose all sessions but as @mmd-osm says it's only those three machines that we're talking about here and they last restarted in November last year:

image

At that time it took nearly two months for the caches to fill up which suggests that it should take about that long for things to get expired unless there has been a significant increase in the cache usage since.

tomhughes avatar Jul 07 '24 17:07 tomhughes

The eviction rate has increased since November but it hasn't consisntently bee more than double. commands/second has remaind the same

pnorman avatar Jul 12 '24 03:07 pnorman

I logged back in 5 days ago: 1 day later my session was still active but today I'm logged out. We can also see a dip today from ~100 millions items in cache to ~66 millions.

I suggest to store the sessions in the DB and use memcache only to speed up sessions check for frequently used sessions.

tbertels avatar Jul 12 '24 07:07 tbertels

One of the machines was rebooted yesterday while fighting the DDOS so 1/3 of the the cache entries were lost.

tomhughes avatar Jul 12 '24 07:07 tomhughes

I'm wondering how many of these entries originate from CGImap (key prefix would be "cgimap:"). For some reason, these entries have the expiration value set to 0 (unlimited). This doesn't make a whole lot of sense for rate limiting requests, where the exact timestamp would be known upfront at which time these entries become irrelevant.

mmd-osm avatar Jul 12 '24 13:07 mmd-osm

At least when testing locally, I've noticed that every anonymous user creates a rails session without expiry (that's the "0" in "1 0 73" below), whereas logged in users have an entry with 4-5 weeks expiration.

Anonymous user sessions:

/usr/share/memcached/scripts/memcached-tool localhost:11211 dump 
Dumping memcache contents
add rails:session:2::2d28d018bdda81f05bae57ba42ee200a7a14af6df74134bb93ee82f99bf7baab 1 0 73
{I"_csrf_token:EFI"096xa2ms9DVncEF7CBUeBJ0wP9VYJrKO6lzxqDomep74;F

Logged in user:

Expires at 1723288155 = Sat Aug 10 13:09:15 CEST 2024

add rails:session:2::2d28d018bdda81f05bae57ba42ee200a7a14af6df74134bb93ee82f99bf7baab 1 1723288155 200
{	I"_csrf_token:EFI"096xa2ms9DVncEF7CBUeBJ0wP9VYJrKO6lzxqDomep74;FI"	user;FiI"fingerprint;FI"E....

mmd-osm avatar Jul 13 '24 11:07 mmd-osm

Expiry shouldn't really matter that much because anything that isn't used just moves down the LRU list and gets discarded eventually when we need space for a new entry.

Logged in sessions (with "remember me" checked) do get an expiry of 28 days which matches the cookie expiry while other sessions (not logged in and logged in without "remember me" checked) actually don't have an expiry but issue a session cookie that expires when the browser is closed.

tomhughes avatar Jul 13 '24 11:07 tomhughes

First of all, I find it a bit difficult to reason about the logged in sessions based on Prometheus stats, in particular after how many days these entries would be discarded.

memcached has an LRU crawler which reclaims expired entries even before they're reaching the end of the LRU list. With a non zero TTL, we might get rid of many "non-logged in user" entries early on, before they might evict "logged in user" entries.

mmd-osm avatar Jul 13 '24 13:07 mmd-osm

At the current growth rate, we will likely see some evictions in about 10 days (=21 days after last memcached restart).

@jidanni : did you notice any issues with lost login sessions in the last 8-9 days? If so, it can’t be memcached related…

mmd-osm avatar Jul 23 '24 16:07 mmd-osm

It's not that simple because only one machine was reset I think? So only keys which hash to that machine are currently exempt from being evicted.

tomhughes avatar Jul 23 '24 16:07 tomhughes

I think spike-06..08 were all restarted, the aggregated cached items count on Prometheus shows 0 entries about 10 days ago.

mmd-osm avatar Jul 23 '24 16:07 mmd-osm

@mmd-osm rather than using my misty memory, surely there must be some internal logs you can check regarding me (user: jidanni) that can give you precise details.

jidanni avatar Jul 24 '24 04:07 jidanni

We want to hear from you first hand, as you’ve also raised the issue. Misty memory is ok. If you say it hasn’t bothered you recently then that’s good enough for now.

What we see in the charts right now is that no entries are being removed. So chances are that your session is still around.

mmd-osm avatar Jul 24 '24 06:07 mmd-osm

Okay. I will remember next time to report each and every incident right here to the thread.

jidanni avatar Jul 24 '24 08:07 jidanni

Okay. Just had to log in again as you can see in your logs perhaps.

jidanni avatar Aug 05 '24 12:08 jidanni

Thank you for the feedback. This is not completely unexpected. Evicting entries started again on August 1st, even a bit sooner than estimated.

mmd-osm avatar Aug 05 '24 13:08 mmd-osm

On a laptop I hadn't used in five days: Had to login again to OSM. But didn't need to login again to GitHub to add this comment.

jidanni avatar Aug 31 '24 02:08 jidanni

At least 8 other users have reported the same issue in https://community.openstreetmap.org/t/osm-webseite-standiges-login-notig/120072

All different browsers, not only Firefox. I could also reproduce it today on my mobile.

spike-0[6-8] are seeing some cache evictions since a few days again:

image

image

Following up on my previous comment to get rid of anonymous sessions as early as possible, we could check how the Gitlab repo addressed the issue. They're having similar issues with Redis and unauthenticated users filling up the memory. Redis and Memcached implementations should be fairly similar, ['rack.session.options'][:expire_after] is also used by the memcached client.

Initially, Gitlab added a special helper for this purpose: https://gitlab.com/gitlab-org/gitlab/-/blob/ee088fc0d53198016e245c515f28e03d8229e297/app/controllers/application_controller.rb#L29 and some PRs on the topic: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/88514/diffs

Helper: https://gitlab.com/gitlab-org/gitlab/-/blob/ee088fc0d53198016e245c515f28e03d8229e297/app/helpers/sessions_helper.rb#L17-41

Lately they seem to have moved it to an own rack middleware to cover more scenarios: https://gitlab.com/gitlab-org/gitlab/-/commit/8c85364205ccb1f4602ab3543d10ff55295bd6cc

This might be worthwhile checking out.

mmd-osm avatar Oct 13 '24 09:10 mmd-osm

I've adjusted the Gitlab code a bit to work with the osm website: https://github.com/mmd-osm/openstreetmap-website/tree/patch/sessionexpiry

It's more of a proof of concept at this time, to demo the idea. I can create a PR to continue the discussion, if needed. It should also not interfere with session_persistence.rb and session_methods.rb, which define a cookie expiration for logged on users only.

For testing, I recommend to check results of "memcached-tool localhost:11211 dump" after each activity, in particular the TTL value. That's second last value in each line starting with "add rails:session:2:..." (format: unix epoch).

/fyi: @AntonKhorev


Meanwhile, memcached has also been restarted or purged, so we're down to 0 evictions for the next few weeks.

mmd-osm avatar Oct 14 '24 17:10 mmd-osm

Today had to login again.

jidanni avatar Oct 19 '24 23:10 jidanni

Today had to login again too.

jidanni avatar Jan 27 '25 03:01 jidanni

Also today Wed Jan 29 03:07:20 AM UTC 2025 had to login again too. Yes, "Remember me" checked last time as always I do.

jidanni avatar Jan 29 '25 03:01 jidanni

I had to login again on 2025-02-26 and today (2025-02-29).

tbertels avatar Jan 29 '25 06:01 tbertels

I had to login again on 2025-02-26 and today (2025-02-29).

Me too, today 2025/2/3.

Hey wait, your clock is a month ahead.

Anyway,

Feature request

Most valued users list

The Most valued users list would be composed of, let's say people who you don't want to log off often. Who? Backers who have donated more than a million dollars, heads of state, influential professors, etc., and me, your first test guinea pig.

Every effort shall be made never to log them off, except as a once a year security exercise with a notice on their screens (thus turning getting logged off into a happy experience (staff cares about my security.))

They will have their own private memcache pool or whatever behind the scenes to assure them that OSM is running smoothly and their donations were put to good use. Of course they would never hear the word memcache. All they know is the site is up.

If the program is a success it would be quietly expanded behind the scenes to eventually encompass all users.

jidanni avatar Feb 03 '25 05:02 jidanni