ejabberd icon indicating copy to clipboard operation
ejabberd copied to clipboard

LDAP request failed timed out

Open nunohorta opened this issue 9 years ago • 19 comments

I have ejabberd 15.07 running with LDAP authentication. This works fine for 95% of the times. Randomly a user is unable to authenticate. Looking at ejabberd.log the following log is displayed moments after it communicates with the server.

2015-10-21 14:20:47.227 [error] <0.7412.0>@eldap_pool:do_request:75 LDAP request failed: timed out 2015-10-21 14:20:47.227 [info] <0.7412.0>@ejabberd_c2s:wait_for_feature_request:756 ({socket_state,p1_tls,{tlssock,#Port<0.28105>,#Port<0.28107>},<0.7411.0>}) Accepted authentication for johndoe by ejabberd_auth_ldap from 31.112.96.168 2015-10-21 14:20:47.227 [debug] <0.7412.0>@ejabberd_c2s:send_text:1901 Send XML on stream = <<"<success xmlns='urn:ietf:params:xml:ns:xmpp-sasl'/>">> 2015-10-21 14:20:47.227 [debug] <0.7412.0>@ejabberd_socket:send:187 Error in p1_tls:send: {error,einval}

If I try again it will happen the same, over and over, while other users can log in just fine and the chat keeps working correctly. Waiting for longer periods, let's say 10 ou 20 minutes, or restart the server, will allow the user to log in again. Is there anything that I should change on my side to fix this issue?

nunohorta avatar Oct 21 '15 13:10 nunohorta

2015-10-21 14:20:47.227 [error] <0.7412.0>@eldap_pool:do_request:75 LDAP request failed: timed out

Did you check LDAP logs? Did you investigate the disconnections via tcpick or something? There is not enougth information for us to give any advice.

zinid avatar Oct 21 '15 13:10 zinid

I'll investigate those logs as soon as possible!

nunohorta avatar Oct 21 '15 14:10 nunohorta

sorry but I can´t find any relevant LDAP logs. I'll do a better investigation next time this occurs. It can be closed for now, thanks

nunohorta avatar Oct 25 '15 13:10 nunohorta

Hello,

I can confirm this behavior, especially under heavy load. The LDAP server is running on a different machine and there's a Cisco ASA between them. I highly suspect ejabberd to have a permanent connection to LDAP, Cisco ASA reset it after a while, ejabberd is unable to detect the broken link.

I found this blog post saying nearly the same: http://start.nwt.fhstp.ac.at/blog/?p=307

eLvErDe avatar Nov 11 '15 08:11 eLvErDe

Another one: http://lists.jabber.ru/pipermail/ejabberd/2006-September/002167.html

eLvErDe avatar Nov 11 '15 08:11 eLvErDe

Yeah I have the same scenario where LDAP is running on a different server. It's not happening since I created the issue, but it did happen 3 ou 4 times before

nunohorta avatar Nov 11 '15 09:11 nunohorta

According to ejabberd code, I think it now requests for TCP keepalive when establishing LDAP connection: https://github.com/processone/ejabberd/blob/master/src/eldap.erl#L1064

However, we may have a network setup issue here: Default Linux keepalive is 7200s (2 hours) and there's a Cisco ASA firewall between the ejabberd server and the LDAP server which seems to reset the connection after 1 hour of inactivity.

I chanded the server's keepalive parameters:

root@server:~# cat /etc/sysctl.d/98-local-keepalived.conf 
### Cisco ASA closes connection after 1hours inactivity
###
### ejabberd triggers keepalive but never closes and restarts LDAP
### connections pool
###
### Workaround by configuring keepalive shorter than Cisco timeout
# Start keepalive after 30min since last real packet
net.ipv4.tcp_keepalive_time = 1800
# Send keepalive packet every 1 minute
net.ipv4.tcp_keepalive_intvl    = 60
# Consider the connection being dead after 5 tries (5x1min)
net.ipv4.tcp_keepalive_probes   = 5

And I'm quite confident it'll fix the issue.

However, I don't think ejabberd behavior is sane. First I think it shouldn't expect for permanent connection. It should have a maximum lifetime setting and should handle the LDAP connection pool by closing too old connections and opening new ones. Second, it should be able to detect dead connection without the help of keepalive mechanism. What's the point requesting again and again and established connection that returns nothing ? Just close it, reopen and give another try.

I'm not a dev, so I may be saying bullshit.

eLvErDe avatar Nov 11 '15 09:11 eLvErDe

it's happening to me again...I'll try your solution to see if it fixes.

nunohorta avatar Nov 13 '15 11:11 nunohorta

Problem looks fixed for us with my workaround

eLvErDe avatar Nov 13 '15 12:11 eLvErDe

uh? Closed? Why?

There's obviously something wrong in ejabberd behavior. It cannot expect lifetime connection with no ability to recover from a failure imho.

Regards, Adam.

eLvErDe avatar Dec 01 '15 11:12 eLvErDe

Yeah it's still happening here(running 15.10). I have installed tcpick but I can't seem to get it working properly. If you guys could help me with tcpick so I can catch the logs that would be great. Every time this happens I have to restart the server and I simply cannot change what @eLvErDe has changed on this server. Update: I did the same changes as @eLvErDe but it keeps timing out.

nunohorta avatar Dec 02 '15 11:12 nunohorta

Yep same issue here. No Cisco in between though, just one VM running CentOS 7.2 x64 with eJabberd 14.07 from http://files.lucidsolutions.co.nz/linux/centos/ejabberd-el7/ejabberd-14.07-6.el7.centos.x86_64.rpm talking to another VM running MS SBS 2011 AD DS.

k-poole avatar Jul 19 '16 01:07 k-poole

It looks like I see this problem too, whole the way to 17.08 And as @eLvErDe pointed out: I do believe a client (LDAP or else) should never assume endless tcp session life time (especially when using a pool - see src/eldap_pool.erl). From what I've collected, it is recommended to implement a periodical no-op or send an abandon messageId=0 to implement a keep-alive. Unfortunately, my experience with Erlang does not allow me to propose a PR to implement this: :-(

btlogy avatar Nov 27 '17 14:11 btlogy

Any news on this? We've this problem several times a day :( Centos 7, ejabberd 18.01

zavinator avatar Mar 09 '18 09:03 zavinator

I request honestly a solution to this too...

Nearly all users experience an "invalid user / password", because the connection to both (two domaincontrollers) were interrupted. After a few retries (saved login-data) they're able to login (new connection from ejabberd to ldap).

Server: Debian LDAP: Windows AD (2012 r2) Clients: Gajim (Windows), Conversations (Android)

Bundesferkel avatar Jul 20 '18 09:07 Bundesferkel

Same problem for us. LDAP timeout even when low trafic. Ubuntu 16.04.5 LTS ejabberd 16.01-2 Active Directory 2012 R2 pfSense 2.3.4-p1 OS X clients with Apple Messages

stefanotoro avatar Jul 27 '18 13:07 stefanotoro

Today one of the three LDAP servers was not reachable, because it was upgraded. I assumed ejabberd would failover to the other two, but no, I get [error] <0.581.0>@eldap:connect_bind:1078 LDAP connection to ldap.server:636 failed: connection refused and (some) users cannot login. What is the point in having a list of LDAP servers if ejabberd does no failover if one of them is unavailable? ejabberd 18.06-1 debian stretch

photoninger avatar Dec 17 '18 15:12 photoninger

Any news about it?

Neustradamus avatar Jul 04 '21 23:07 Neustradamus

@rkujawa, @eLvErDe, @photoninger, @zavinator, @btlogy, @k-poole, @stefanotoro, @Bundesferkel: What do you think with current version?

Neustradamus avatar May 14 '23 13:05 Neustradamus