ejabberd icon indicating copy to clipboard operation
ejabberd copied to clipboard

Upgrading to 24.12 caused 5x memory consumption

Open obarash opened this issue 9 months ago • 8 comments

Environment

  • ejabberd version: 24.12
  • Erlang version: 14.2.1 (Erlang/OTP 26)
  • OS: Linux (Debian)
  • Installed from: docker:ecs

Bug description

After upgrading ejabberd from v22.10 to v24.12, we are seeing a major increase in memory consumption, making the system unstable without additional hardware.

System Setup: • Cluster: 3 nodes initially, expanded to 6 nodes to mitigate issue • Concurrent Users: ~3600 users total • Previous Version: 23.01 • Current Version: 24.12 • Hardware: Each node with sufficient CPU and RAM capacity • Database: AWS RDS (MySql)

Observed Behavior:

State Number of Nodes Total Memory Used Memory per Node
v22.10 (before upgrade) 3 ~1.6 GB
v24.12 (immediately after upgrade) 3 ~2 GB
v24.12 (after 4 hours) 3 ~5 GB
v24.12 (after scaling to 6 nodes) 6 ~9.4 GB

Steps Taken to Mitigate: • Scaled cluster from 3 nodes → 6 nodes • Performed rolling restarts to balance users across 6 nodes • Current state is somewhat stable, but total memory use has increased nearly 6x compared to previous version (9.4 GB vs. 1.6 GB).

Notes: • No significant change in number of concurrent users or traffic pattern. • No configuration changes except for the version upgrade.

We need Guidance, is this a known issue or configuration change required in v24.12 ? Maybe some core module have new behavior that we need to consider ?

Here are some images to show the issues:

The change happened on 12:30:

Image

Right before the changes:

Image

after a few hours:

Image

After we added 3 more machines, and reset the existing ones:

Image

obarash avatar Jun 12 '25 14:06 obarash

Cleaned up config? SQL used? Details?

licaon-kter avatar Jun 12 '25 14:06 licaon-kter

Regarding SQL I mentioned the DB used it AWS RDS, please aim me for specific details.

Previous Ejabberd version was 23.01 (not 22.10), I updated the orgiainl content

Config:

###
###              ejabberd configuration file
###
### The parameters used in this configuration file are explained at
###
###       https://docs.ejabberd.im/admin/configuration
###
### The configuration file is written in YAML.
### *******************************************************
### *******           !!! WARNING !!!               *******
### *******     YAML IS INDENTATION SENSITIVE       *******
### ******* MAKE SURE YOU INDENT SECTIONS CORRECTLY *******
### *******************************************************
### Refer to http://en.wikipedia.org/wiki/YAML for the brief description.
###

language: "en"

hosts:
  - obarash.com

loglevel: info
log_rotate_size: 1048576000
log_rotate_count: 7

certfiles:
  - "/home/ejabberd/conf/domain.pem"
ca_file: "/home/ejabberd/conf/cacert.pem"

sql_type: mysql
sql_server: "mysql"
sql_database: "ejabberd"
sql_username: "ejabberd"
sql_password: "passwsord"
sql_port: 3306
sql_pool_size: 20
sql_keepalive_interval: 1
sql_start_interval: 5
sql_ssl: false
sql_ssl_verify: false
sql_ssl_cafile: "/tmp/cacert.crt"
new_sql_schema: false
update_sql_schema: false
default_db: sql
default_ram_db: mnesia
auth_method: sql

allow_contrib_modules: true
cache_size: 20000
max_fsm_queue: 30000


listen:
  - port: 5222
    ip: "::"
    module: ejabberd_c2s
    protocol_options:
      - "no_sslv2"
      - "no_sslv3"
      - "no_tlsv1"
      - "no_tlsv1_1"
    ciphers: "ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS"
    starttls: true
    starttls_required: false
    tls_compression: false
    max_stanza_size: 262144
    shaper: c2s_shaper
    access: c2s
    use_proxy_protocol: false

  - port: 5269
    ip: "::"
    module: ejabberd_s2s_in
    max_stanza_size: 524288

  - port: 5443
    ip: "::"
    module: ejabberd_http
    request_handlers:
      /upload: mod_upload
      /ws: ejabberd_http_ws
      /bosh: mod_bosh
    captcha: false
    use_proxy_protocol: false
    tls: false

  - port: 5280
    ip: "::"
    module: ejabberd_http
    request_handlers:
      /admin: ejabberd_web_admin
      /api/v0: mod_http_api
      /api: mod_http_api
    captcha: false
    use_proxy_protocol: false
    tls: false

  - port: 1111
    ip: "::"
    module: ejabberd_http
    request_handlers:
      /metrics: mod_obarash_custom-1
    captcha: false
    use_proxy_protocol: false
    tls: false

s2s_use_starttls: optional


acl:
  local:
    user_regexp: ""
  loopback:
    ip:
      - 127.0.0.0/8
      - ::1/128
  admin:
    user:
      - "[email protected]"
      - "[email protected]"


access_rules:
  local:
    allow: local
  c2s:
    deny: blocked
    allow: all
  announce:
    allow: admin
  configure:
    allow: admin
  muc_create:
    allow: local
  pubsub_createnode:
    allow: local
  register:
    allow: local
  trusted_network:
    allow: loopback


api_permissions:
  "console commands":
    from:
      - ejabberd_ctl
    who: all
    what: "*"
  "admin access":
    who:
      # - ip: "172.19.170.0/24"
      # - ip: "172.19.140.0/24"
      # - ip: "172.18.64.0/23"
      - ip: "127.0.0.0/8"
      - access:
        - allow:
          - acl: loopback
          - acl: admin
    what:
      - "*"
      - "!stop"
      - "!start"
  "public commands":
    who:
      ip: 127.0.0.1/8
    what:
      - status
      - connected_users_number


shaper:
  normal:
    rate: 3000
    burst_size: 20000
  fast: 100000


shaper_rules:
  max_user_sessions: 10
  max_user_offline_messages:
    5000: admin
    500: all
  c2s_shaper:
    none: admin
    normal: all
  s2s_shaper: fast


acme:
  auto: false
  contact: "mailto:[email protected]"
  ca_url: "https://acme-v02.api.letsencrypt.org/directory"


modules:
  mod_adhoc: {}
  # mod_admin_update_sql: {}
  mod_admin_extra: {}
  mod_announce:
    access: announce

  mod_avatar: {}
  mod_blocking: {}
  mod_bosh: {}
  mod_caps:
    use_cache: true

  mod_carboncopy: {}
  mod_client_state: {}
  mod_configure: {}
  mod_disco: {}
  # mod_fail2ban: {}
  mod_http_api: 
    default_version: 0
  # mod_http_upload: {}
  mod_last: {}
  mod_mam:
    ## Mnesia is limited to 2GB, better to use an SQL backend
    ## For small servers SQLite is a good fit and is very easy
    ## to configure. Uncomment this when you have SQL configured:
    db_type: sql
    assume_mam_usage: true
    default: always
    user_mucsub_from_muc_archive: true

  # mod_mqtt: {}
  mod_muc:
    access:
      - allow
    access_admin:
      - allow: admin
    access_create: 
      - allow: admin
    access_persistent: muc_create
    access_mam:
      - allow
    default_room_options:
      allow_user_invites: false
      allow_subscription: true
      allow_change_subj: false
      allow_query_users: true
      allowpm: anyone
      mam: true
      members_by_default: true
      members_only: false
      logging: true
      persistent: true
      anonymous: false
      public: false
      presence_broadcast:
        - visitor
    history_size: 0
    max_users: 5000
    max_user_conferences: 5000
    preload_rooms: false

  mod_muc_admin: {
    subscribe_room_many_max_users: 4000
  }
  mod_offline:
    access_max_user_messages: max_user_offline_messages

  mod_ping:
    send_pings: true
    ping_interval: 60
    ping_ack_timeout: 30
    timeout_action: kill

  mod_privacy: {}
  # mod_private: {}
  mod_proxy65:
    access: local
    max_connections: 5

  mod_pubsub:
    access_createnode: pubsub_createnode
    ## reduces resource comsumption, but XEP incompliant
    ignore_pep_from_offline: false
    ## XEP compliant, but increases resource comsumption
    ## ignore_pep_from_offline: false
    last_item_cache: false
    max_items_node: 1000
    plugins:
      - flat
      - pep
    force_node_config:
      ## Change from "whitelist" to "open" to enable OMEMO support
      ## See https://github.com/processone/ejabberd/issues/2425
      "eu.siacs.conversations.axolotl.*":
        access_model: whitelist
      ## Avoid buggy clients to make their bookmarks public
      storage:bookmarks:
        access_model: whitelist
      obarash:roster:x:
        access_model: presence
        notification_type: normal

  mod_push: {}
  mod_push_keepalive: {}
  mod_register:
    ## Only accept registration requests from the "trusted"
    ## network (see access_rules section above).
    ## Think twice before enabling registration from any
    ## address. See the Jabber SPAM Manifesto for details:
    ## https://github.com/ge0rg/jabber-spam-fighting-manifesto
    ip_access: trusted_network

  mod_roster:
    versioning: true
    store_current_id: true
    db_type: sql
    cache_size: 80000

  mod_s2s_dialback: {}
  mod_shared_roster: {}
  mod_stream_mgmt:
    resend_on_timeout: true
    resume_timeout: 30
    max_ack_queue: 10000

  # mod_stun_disco: {}
  mod_vcard: {}
  mod_vcard_xupdate: {}
  mod_version:
    show_os: false

  mod_obarash_custom-1: {}
  mod_obarash_custom-2: {}
  mod_obarash_custom-3: {}
  mod_obarash_custom-4: {}
  mod_obarash_custom-5: {}
  mod_obarash_custom-6: {}
  mod_obarash_custom-7: {}
  mod_obarash_custom-8: {}
  mod_obarash_custom-9: {}
  mod_obarash_custom-10: {}
  mod_obarash_custom-11: {}
  mod_obarash_custom-12: {}
  mod_obarash_custom-13: {}


### Local Variables:
### mode: yaml
### End:
### vim: set filetype=yaml tabstop=8

obarash avatar Jun 13 '25 13:06 obarash

Check activity of spam bots. Maybe spammer uses your server for out (sending) spam.

member7me avatar Jun 13 '25 18:06 member7me

@member7me We have ruled out the spam bot option, we have checked the all connected users and approved each one of them. In addition, this behavior has started just after upgrading the server, that indicates it's not rlated to any external unwanted activity. Our passwords are GUID, changed frequently (acctually, every time user logs in).

relbraun avatar Jun 15 '25 11:06 relbraun

You followed each version upgrade note, intermediary too? So you've read https://docs.ejabberd.im/admin/upgrade/#specific-version-upgrade-notes ?

licaon-kter avatar Jun 15 '25 12:06 licaon-kter

23.01 -> 24.12

We need Guidance, is this a known issue or configuration change required in v24.12 ?

I don't remember any report similar to this. Make sure you followed all the upgrade notes.

No configuration changes except for the version upgrade.

Ok, so you made the minimal configuration changes required to upgrade ejabberd, and did not enable new modules or options.

Maybe some core module have new behavior that we need to consider ?

Yes, that's probably the case. Taking a quick look at the roadmap, there were many improvements and changes that take effect automatically.

No significant change in number of concurrent users or traffic pattern.

Are your users humans performing typical tasks (change presence every few minutes, send message every few seconds, chatting in a few reasonable-size chatrooms, ...) or are they programs that may send big amount of presence changes, messages per second, or may be in large chatrooms (more than 100)?

Are your users using well-known XMPP clients/libraries, or are using less-known or custom-made clients/libraries that may trigger some edge-case in ejabberd?

Check the ejabberd log files: do they show unusual behaviour? like reconnections, error messages, warnings...

It may be possible that whatever the problem is, it's already solved in recent ejabberd, but I imagine you cannot setup a temporary server with just 1 node running 25.04 to test for a few minutes if it behaves as in 23.01 or still consumes memory as in 24.12...

Let's assume some change in ejabberd drives crazy the clients, or the clients now trigger some edge-case in ejabberd. And let's assume the consumption is concentrated in just 1 feature, 1 process, or 1 process type...

There are several ways to view the erlang processes (and their consumption) that live in an erlang node:

etop

ejabberdctl etop

observer_cli

ejabberdctl module_install ejabberd_observer_cli
ejabberdctl debug
ejabberd_observer_cli:start().

then navigate over the console: press H + Enter, etc

sort erlang processes by their memory usage, or by "reductions"


Glossary:

the reduction counter is normally incremented by one for each function and BIF call

Built-In Functions (BIFs) are implemented in C code in the runtime system. BIFs do things that are difficult or impossible to implement in Erlang.

badlop avatar Jun 17 '25 13:06 badlop

@badlop thanks for your response. We are using ejabberd/ecs docker image and I see that both ejabberdctl etop and ejabberdctl module_install ejabberd_observer_cli are not available. On etop command we get the response:

Error! Failed to load module 'etop' because it cannot be found. Make sure that the module name is correct and
that its .beam file is in the code path.

And for ejabberdctl module_install ejabberd_observer_cli we get: Error: not_available.

Regarding the users behavior, our user's behavior didn't change from the previous version. We have a few thousands of mobile users using the Smack client library for android, they move from place to place so sometimes they disconnect from the server due to network inavailability. we have some groups that bigger than 100 members, but the most of them are with 10-20 members. But the most important point is, that this behavior was the same before we upgraded the server.

relbraun avatar Jun 19 '25 05:06 relbraun

Error! Failed to load module etop

The ecs container image does not include the observer library nor its etop module, which comes from Erlang/OTP.

Solution: you could switch to the ejabberd container image, as that one includes observer, and consequently also etop (if interested, you can check the image differences).

But wait, etop provides very little information anyway, so try the next idea, which works correctly with the ecs container image too:

And for ejabberdctl module_install ejabberd_observer_cli we get: Error: not_available.

The ecs container image does not include the ejabberd-contrib git repository, and does not include git or mix required to download the dependencies.

This is a step by step solution:

  1. Tell ejabberd to download the ejabberd-contrib git repository:
$ podman exec ejabberd-ejabberd ejabberdctl modules_update_specs
  1. ejabberd_observer_cli depends on other libraries that ejabberd will attempt to download using git or mix... Let's install git in the container image:
$ podman exec --user root ejabberd-ejabberd apk add git

fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
(1/9) Installing ca-certificates (20241121-r1)
(2/9) Installing c-ares (1.27.0-r0)
(3/9) Installing libunistring (1.1-r2)
(4/9) Installing libidn2 (2.3.4-r4)
(5/9) Installing nghttp2-libs (1.58.0-r0)
(6/9) Installing libpsl (0.21.5-r0)
(7/9) Installing libcurl (8.12.1-r0)
(8/9) Installing pcre2 (10.42-r2)
(9/9) Installing git (2.43.6-r0)
Executing busybox-1.36.1-r19.trigger
Executing ca-certificates-20241121-r1.trigger
OK: 48 MiB in 72 packages
  1. Let's download dependencies, compile everything and install it:
$ podman exec ejabberd-ejabberd ejabberdctl module_install ejabberd_observer_cli

I'll download "observer_cli" using git because I can't use Mix to fetch from hex.pm:
Runtime terminating during boot ('cannot expand $RELEASE_LIB in bootfile')

Crash dump is being written to: /home/ejabberd/logs/erl_crash_20250619-215914.dump...done
I'll download "recon" using git because I can't use Mix to fetch from hex.pm:
Runtime terminating during boot ('cannot expand $RELEASE_LIB in bootfile')

Crash dump is being written to: /home/ejabberd/logs/erl_crash_20250619-215914.dump...done
Fetching dependency observer_cli: Cloning into 'observer_cli'...
Fetching dependency os_stats: Cloning into 'os_stats'...
Fetching dependency recon: Cloning into 'recon'...
Inlining: inline_size=24 inline_effort=150
Old inliner: threshold=0 functions=[{insert,2},{merge,2}]
Module ejabberd_observer_cli has been installed.
Now you can configure it in your ejabberd.yml
I'll download "observer_cli" using git because I can't use Mix to fetch from hex.pm:
Runtime terminating during boot ('cannot expand $RELEASE_LIB in bootfile')
  1. It showed a few error messages, but in reality all the 29 files are correctly installed:
$ podman exec ejabberd-ejabberd ls .ejabberd-modules/ejabberd_observer_cli/ebin | wc -l
29
  1. Let's start an erlang shell attached to the running ejabberd node, and then start ejabberd_observer_cli:
$ podman exec -it ejabberd-ejabberd ejabberdctl debug
Erlang/OTP 26 [erts-14.2.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit]

Eshell V14.2.1 (press Ctrl+G to abort, type help(). for help)
(ejabberd@localhost)1> ejabberd_observer_cli:start().
  1. This cleaned the shell window and display a few ejabberd statistics. Now press H then Enter to view the main erlang statistics, and let's hope you get some clue about what is consuming so much memory in your server.

badlop avatar Jun 19 '25 22:06 badlop