ejabberd icon indicating copy to clipboard operation
ejabberd copied to clipboard

mod_vcard_ldap and jpegPhoto enabled causing memory leak

Open timeos opened this issue 9 years ago • 21 comments
trafficstars

What version of ejabberd are you using?

ejabberd-16.04-0.x86_64

What operating system (version) are you using?

CentOS Linux release 7.2.1511 (Core)

How did you install ejabberd (source, package, distribution)?

official RPM package from your webpage

What did not work as expected? Are there error messages in the log? What was the unexpected behavior? What was the expected result?

once we keep avatars enabled/activated within mod_vcard_ldap, then after some time (half to one day) the beam.smp will be killed by schedulers oom-killer due huge amount of memory it consumes.

regards our environment:

  • one server environment
  • mnesia DB as backend
  • we use LDAP authentication with mod_shared_roster_ldap and with mod_vcard_ldap
  • hosts approx 120 users. each user has its own foto/avatar in LDAP. all photos are correctly working in other tools mapped with LDAP.

system resources: 2cores, 1,7GB RAM, 1.5GB SWAP

if photos are enabled, overal load starting to increase with RAM consumption. as you can see from attached kernel snip, beam.smp processes consumes all memory and SWAP.

if photos are disabled, the overal load is minimal: load average: 0.04, 0.03, 0.05, environment stable, RAM memory consumption with about 370MB totally.

Last time issue happen this weekend, when only few users were connected (up to 10 users).

just one more important info at the end: this issue is exactly the same like we had with using ejabberd version 2.1.13 (EPEL package for centos6). Having photos working was one of our reasons to move to the new version.

attaching snip from syslog. attaching configuration snip.

ejabberd-config-snip.txt ejabberd-kernel-log.txt

if anything else needed to provide, just let me know.

thanks

timeos avatar May 09 '16 07:05 timeos

I do not think it is a leak. It is just that parsing / managing photo uses more memory and cause Erlang to need to allocate more memory when photos are received and send. In that context, I think your need larger server.

mremond avatar May 09 '16 08:05 mremond

I cant believe that such small photos (up to few KBs) can cause such huge memory consumption. Even if we are talking about 120 users totally. Yes, I can allocate more memory, but how much can I expect it will consume?

timeos avatar May 09 '16 08:05 timeos

ok, I will do test this evening - I will allocate 5GB of RAM to this system and will monitor tomorrow how it will behave.

timeos avatar May 09 '16 09:05 timeos

ok, seems this is definitely some memory leak - today I have assigned 5GB RAM in total, enabled jpegPhoto LDAP VCARD attributes and start monitoring. At beginning I was able to open "user info" and get photo and other user details, now - after a few minutes - once I click in pidgin on "Get Info" on contact in rooster, I get "Information: Retrieving..." and beam.smp process starts to consume memory - each reload of "top/htop" (one second) I can see that process has increased memory consumption (+10MB each reload) and then (after next few seconds) it started to increase exponentially (+100, +250M, 1000M) and exhaust all memmory (up to 5GB). this is followed with CPU utilization (both assigned cores) and naturally with overal system load. It stops increasing once pidgin retrieve-timeout is reached - then I get contact info without avatar, but with all rest info.

timeos avatar May 09 '16 20:05 timeos

Not sure. Pattern you describe seems to match Erlang memory allocation pattern (usually double allocation from previous time when it needs memory. We would need testing / benchmark to see if there is real memory issue with that feature.

mremond avatar May 10 '16 07:05 mremond

yes, it could be correct allocation pattern, but this huge allocation happened just because I fetch info about one single user - I was seeing this memory allocation progress while I waited for user info ("Information: Retrieving..."). As I said we are talking about photos with up to few kBs of overall capacity. For example, now it is working fine (since morning) - It has allocated up to 900MB RAM just for beam.smp (in comparison to up to 300MB RAM without enabling avatars).

timeos avatar May 10 '16 13:05 timeos

while I wrote previous message, the memory consumption started again to increase with same pattern up to 3GB just for beam.smp and then it was freed again to ~900MB. seems such behaviors happen many times since morning, but now I have just started to have a look on htop.

timeos avatar May 10 '16 13:05 timeos

another-related issue: for example today, I am unable to fetch user info ("Information: Retrieving...") and one CPU Core is at 100% all the time. beam.smp is all the time running. no crashes or errors. nothing special in ejabberd.log (only connection/session-related messages). ejabberd-htop.txt

I think some process is busy with fetching user vcard. it is utilizing CPU and stucked in some state causing unability to get user vcard info successfully.

timeos avatar May 11 '16 07:05 timeos

not sure if it can help anyhow, but I have just did strace for the most utilized process of beam.smp. attaching here 30secs long strace. utilized-beam.smp.zip Strace for beam.smp process with zero utilization gives just: futex(0x7f47c0400210, FUTEX_WAIT_PRIVATE, 4294967295, NULL <detached ...>

timeos avatar May 11 '16 11:05 timeos

Now I get feedback from users, that some of their messages to another users within same domain are lost - they did not simply arrived to recipients. this again started after I enabled avatars in mod_vcard_ldap since yesterdays. this did not happen before. Per their feedback, it happen for up to 5 users since yesterday morning - once I re-enabled avatars for this testing purpose with increased memory for this service.

timeos avatar May 11 '16 12:05 timeos

Ok, one more report from me: since I disabled photos last afternoon in mod_vcard_ldap, system is constantly idle with load: load average: 0.00, 0.01, 0.05 and with constant memory consumption ~130MB RAM for beam.smp processes. Not even ocasional higher consumption. all beam.smp procesess are in S state (no multiple processes in "R"unning stat in list like before). everything is fast, gathering user info without even small delay. but yes, photos are now disabled. attaching resource utilization graphs for CPU, MEM and Load for both cases: with enabled and disabled photos in mod_vcard_ldap. I disabled photos at ~3PM 11th May. server-graphs.zip Attaching htop output as well: ejabberd-htop-stable.txt

timeos avatar May 12 '16 09:05 timeos

Hi all.

I can say I am experiencing very similar problems, just installed latest ejabberd on CentOS 7.2. Everything was stable and fast until mod_vcard_ldap was enabled. After this, same problems appeared as @timeos is having. Even trying to get vcard information for a single user is causing high cpu load by beam.smp.

My configuration is as simple as it can get. The XMPP service is in alpha testing in my company, so I only have a couple of users now.

Here's a part of my LDAP-related configuration:

auth_method: ldap
ldap_servers:
  - "id.osec.pl"
ldap_encrypt: tls
ldap_tls_verify: soft
ldap_port: 636
ldap_rootdn: "uid=xxxxxx,cn=users,cn=accounts,dc=osec,dc=pl"
ldap_password: "xxxxx"
ldap_base: "dc=osec,dc=pl"
...
modules:
  mod_shared_roster_ldap:
    ldap_base: "dc=osec,dc=pl"
    ldap_rfilter: "(ipaUniqueID=xxxx)"
    ldap_gfilter: "(&(objectClass=ipausergroup)(cn=%g))"
    ldap_groupdesc: "description"
    ldap_memberattr: "member"
    ldap_memberattr_format: "uid=%u,cn=users,cn=accounts,dc=osec,dc=pl"
    ldap_ufilter: "(&(objectClass=inetOrgPerson)(uid=%u))"
    ldap_userdesc: "displayName"
  mod_vcard_ldap: {}

rkujawa avatar Jun 17 '16 14:06 rkujawa

Commenting out mod_vcard_ldap fixed the load problem, but of course it means I can't access vCard information based on LDAP data...

rkujawa avatar Jun 17 '16 14:06 rkujawa

I've managed to fix the problem. In my case it was caused by a PNG image loaded into jpegPhoto LDAP attribute. Once removed, everything works correctly.

@timeos You might want to check if you perhaps have something else than JPEGs loaded as photos into your LDAP directory.

rkujawa avatar Jun 26 '16 08:06 rkujawa

On a side note, perhaps ejabberd could do some sanity checking on the jpegPhoto data and ignore it, if there's no proper JPEG.

rkujawa avatar Jun 26 '16 08:06 rkujawa

@Rado: seems this is really the case. In our DB we had two contacts with avatars in BMP format instead of JPEG. Now I have converted them, enabled back photos in mod_vcard_ldap and looks pretty stable so far. Many thanks for your valuable info! I am a bit surprised that ejabberd daemon/server is reading photo information - I thought that photo is just "transported" within XMPP to client and only client is looking/reading picture data in order to get that picture displayed.

Yes, I agree with implementing some check on ejabberd side to confirm if whether jpegPhoto is in correct format.

timeos avatar Jun 26 '16 12:06 timeos

Update: after two production days since user photo format correction in LDAP the IM server load is on minimal value (0.01) with stable memory consumption. Thus I can confirm this is really the cause of this issue. so if possible, proceed with sanity check creation in order to proactively avoid of occurrence such mistakes - as currently single photo with wrong format can cause critical load and stability issue on whole ejabberd server.

timeos avatar Jun 28 '16 11:06 timeos

Hi, I'm trying to reproduce the problem, and then I'll write a patch proposal. But as I am not an LDAP admin at all, I'm having trouble setting all the required. So, maybe you can provide me enough information to reproduce the problem without setting all the required:

Can you add this debug line to mod_vcard_ldap.erl, recompile that file, install it, restart ejabberd (or reload the module at runtime), then request vcard with a good photo, and another with a problematic photo, and show here the logged lines? (Of course, you can remove anything sensitive that the log may show)

diff --git a/src/mod_vcard_ldap.erl b/src/mod_vcard_ldap.erl
index f1f0764..6564025 100644
--- a/src/mod_vcard_ldap.erl
+++ b/src/mod_vcard_ldap.erl
@@ -95,6 +95,7 @@ get_vcard(LUser, LServer) ->
     VCardMap = State#state.vcard_map,
     case find_ldap_user(LUser, State) of
 	#eldap_entry{attributes = Attributes} ->
+	    ?INFO_MSG("get_vcard of user ~p:~n~p", [LUser, Attributes]),
 	    VCard = ldap_attributes_to_vcard(Attributes, VCardMap,
 					     {LUser, LServer}),
 	    {ok, [xmpp:encode(VCard)]};

badlop avatar Aug 08 '17 12:08 badlop

hello badlop unfortunately I cant test it but I can tell you that this could me more "image format" rather than LDAP itself. my understanding is following:

  • mod_vcard_ldap fetches data located in jpegPhoto LDAP attribute of specific user.
  • that photo data are most probably forwarded to different subsystem (mod_vcard directly?)
  • but this new subsystem expects JPEG picture format, but instead it get BMP format (maybe also other picture formats are doing same issue)
  • and then funny things starts to happen - picture is unable to be loaded by xmpp client, mod_vcard_ldap retries to get same picture from LDAP, beam process is in loop and CPU utilization is raising up.
  • this all is causing xmpp service unusable - messages from various users are not received by recipients, etc...

maybe I will have a time later in this month to reproduce, but actually I am unable to do it now. anyway thanks that you started to follow up on this, as this is very annoying. now we take care and check whether photo is in right format once user is uploading new avatar.

timeos avatar Aug 10 '17 19:08 timeos

@timeos: Any news about it?

Neustradamus avatar Jul 04 '21 23:07 Neustradamus

@rkujawa, @timeos: What do you think with current version? And about the @badlop patch...

Neustradamus avatar May 14 '23 13:05 Neustradamus