jdiameter
jdiameter copied to clipboard
DWR / DWA not synced causing issues with timeouts
When receiving DWA from a peer, the timeout event and the DWA event are coming almost at the same time, which causes the peer to be disconnected. Two threads are doing it at the same time, which we have a race condition. You can see the log that 2 threads are executing the Event, and the Timeout.
2016-11-02 23:50:46,475 DEBUG [PeerFSMImpl] (Thread-14) Handling event with type [DWA_EVENT] 2016-11-02 23:50:46,475 DEBUG [PeerFSMImpl] (Thread-14) Not performing validation to message since validator is DISABLED. 2016-11-02 23:50:46,475 DEBUG [PeerFSMImpl] (Thread-14) Placing event [Event{name:DWA_EVENT, key:aaa://10.225.20.206:16018, object:MessageImpl{commandCode=280, flags=0}}] into linked blocking queue with remaining capacity: [10000]. 2016-11-02 23:50:46,475 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_2-1) SPeer{Uri=aaa://10.225.20.206; State=OKAY; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@162db65b; incCon{} } FSM switch state: OKAY -> SUSPECT 2016-11-02 23:50:46,477 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_10-9) Got Event [Event{name:DWA_EVENT, key:aaa://10.225.20.206:16018, object:MessageImpl{commandCode=280, flags=0}}] from Queue 2016-11-02 23:50:46,478 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_10-9) Process event [Event{name:DWA_EVENT, key:aaa://10.225.20.206:16018, object:MessageImpl{commandCode=280, flags=0}}]. Peer State is [OKAY] 2016-11-02 23:50:55,466 DEBUG [TimerTask] (SLEE-TimerFacility-thread-3) Task with id -6ee284d6:158250d0338:-7fbb is recurring, not removing it locally nor in the cluster 2016-11-02 23:50:55,466 DEBUG [TimerTask] (SLEE-TimerFacility-thread-3) Firing Timer with id -6ee284d6:158250d0338:-7fbb 2016-11-02 23:50:55,466 DEBUG [TimerFacilityTimerTask] (SLEE-TimerFacility-thread-3) Executing task with timer ID -6ee284d6:158250d0338:-7fbb 2016-11-02 23:50:55,466 DEBUG [TimerFacilityTimerTask] (SLEE-TimerFacility-thread-3) Delay till execution is 1 2016-11-02 23:50:55,466 DEBUG [TimerFacilityTimerTask] (SLEE-TimerFacility-thread-3) Remaining executions:2147483647 2016-11-02 23:51:15,519 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Sending timeout event 2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Handling event with type [TIMEOUT_EVENT] 2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Not performing validation to message since validator is DISABLED. 2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Placing event [Event{name:TIMEOUT_EVENT, key:null, object:null}] into linked blocking queue with remaining capacity: [10000]. 2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Got Event [Event{name:TIMEOUT_EVENT, key:null, object:null}] from Queue 2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Process event [Event{name:TIMEOUT_EVENT, key:null, object:null}]. Peer State is [SUSPECT] 2016-11-02 23:51:15,520 DEBUG [SCTPServerConnection] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Disconnecting SCTP Server Connection aaa://10.225.20.206:16018
hi @SummaNetworks , as far as I understand the issue: current state of peer is : SUSPECT, then DWA_EVENT and TIMEOUT_EVENT are coming, thread 1 processes DWA_EVENT, thread 2 processes TIMEPOUT_EVENT and with current state is SUSPECT -> disconnect. Can you please confirm and correct me If i 'm wrong? Thanks.
Hi, I will try to explain :), first we have the DWR event, you can see on the log that this event changes the state to suspect in thread xxx_10-9 (log time 2016-11-02 23:50:46,475) then we have that event DWA is processed at time 2016-11-02 23:50:46,478, the state should be SUSPECT but in this case is OKEY, when the TIMEOUT event is processed (2016-11-02 23:51:15,520) is reading the state SUSPECT and it is closing the connection. In this situation the DWA after the DWR is responsible for changing the state from SUSPECT to OKEY so TIMEOUT reads the OKEY state and sends a new DWR.
hi @SummaNetworks I looked into source code:
//lock.lock();
try {
if (event != null) {
if (event instanceof FsmEvent && queueStat != null && queueStat.isEnabled()) {
timeSumm.inc(System.currentTimeMillis() - ((FsmEvent) event).getCreatedTime());
timeCount.inc();
}
logger.debug("Process event [{}]. Peer State is [{}]", event, state);
getStates()[state.ordinal()].processEvent(event);
}
if (timer != 0 && timer < System.currentTimeMillis()) {
// ZhixiaoLuo: add lock here to avoid 2 timeout events at the same time if 2 threads get into timer=0
// ZhixiaoLuo: use double check strategy to avoid locking most normal cases
lock.lock();
try {
if (timer != 0 && timer < System.currentTimeMillis()) {
timer = 0;
if (state != DOWN) { //without this check this event is fired in DOWN state.... it should not be.
logger.debug("Sending timeout event");
handleEvent(timeOutEvent); //FIXME: check why timer is not killed?
}
}
}
finally {
lock.unlock();
}
}
}
catch (Exception e) {
logger.debug("Error during processing FSM event", e);
}
finally {
// PCB removed lock
//lock.unlock();
}
Follow that the processEvent function is not synced and cause that issue, I cound not reproduce the issue , I tried to synced that function and check the impact on legacy tests, it is ok, no delay or failing tests.
//lock.lock();
try {
if (event != null) {
if (event instanceof FsmEvent && queueStat != null && queueStat.isEnabled()) {
timeSumm.inc(System.currentTimeMillis() - ((FsmEvent) event).getCreatedTime());
timeCount.inc();
}
logger.debug("Process event [{}]. Peer State is [{}]", event, state);
lock.lock();
getStates()[state.ordinal()].processEvent(event);
}
if (timer != 0 && timer < System.currentTimeMillis()) {
// ZhixiaoLuo: add lock here to avoid 2 timeout events at the same time if 2 threads get into timer=0
// ZhixiaoLuo: use double check strategy to avoid locking most normal cases
// lock.lock();
try {
if (timer != 0 && timer < System.currentTimeMillis()) {
timer = 0;
if (state != DOWN) { //without this check this event is fired in DOWN state.... it should not be.
logger.debug("Sending timeout event");
handleEvent(timeOutEvent); //FIXME: check why timer is not killed?
}
}
}
finally {
// lock.unlock();
}
}
}
catch (Exception e) {
logger.debug("Error during processing FSM event", e);
}
finally {
// PCB removed lock
lock.unlock();
}
About the issue, i just wonder why DWA is back so fast, before finish processing DWR? Thanks
Hi xphudin!!! to avoid the synchronization problem we configure the state machine to work only with 1 thread, but then we are facing the following situation, I will try to explain with the log information and I attach the corresponding pcap file
At 10:54:31,568 we receive a DWA_EVENT
2016-11-08 10:54:31,568 DEBUG SCTPTransportServer$ServerAssociationListener SCTP Server received a message of length: [72] 2016-11-08 10:54:31,568 DEBUG SCTPServerConnection Received message of size [72] 2016-11-08 10:54:31,568 DEBUG SCTPServerConnection In onEvent for connection [aaa://10.225.164.62:43395]. Getting lock 2016-11-08 10:54:31,568 DEBUG SCTPServerConnection Listeners for aaa://10.225.164.62:43395 2016-11-08 10:54:31,568 DEBUG SCTPServerConnection Listener [org.jdiameter.client.impl.controller.PeerImpl$1@6ae9c084] 2016-11-08 10:54:31,568 DEBUG SCTPServerConnection Event Queue for aaa://10.225.164.62:43395 2016-11-08 10:54:31,568 DEBUG PeerImpl Receive message type [280] to peer [aaa://10.225.164.62:43395] 2016-11-08 10:54:31,568 DEBUG PeerFSMImpl Handling event with type [DWA_EVENT] 2016-11-08 10:54:31,568 DEBUG PeerFSMImpl Not performing validation to message since validator is DISABLED. 2016-11-08 10:54:31,568 DEBUG PeerFSMImpl Placing event [Event{name:DWA_EVENT, key:aaa://10.225.164.62:43395, object:MessageImpl{commandCode=280, flags=0}}] into linked blocking queue with remaining capacity: [10000].
At 10:54:59,631 we send a DWR_EVENT over the SCTP connection:
2016-11-08 10:54:59,630 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=DOWN; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca; incCon{aaa://10.225.164.62:43395=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca} }_1-1) Send DWR message 2016-11-08 10:54:59,630 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=DOWN; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca; incCon{aaa://10.225.164.62:43395=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca} }_1-1) Calling connection to send message [MessageImpl{commandCode=280, flags=128}] to peer [aaa://vcmS6a] over the network 2016-11-08 10:54:59,630 DEBUG [SCTPTransportServer](FSM-SPeer{Uri=aaa://vcmS6a; State=DOWN; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca; incCon{aaa://10.225.164.62:43395=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca} }_1-1) About to send a byte buffer of size [100] over the SCTP 2016-11-08 10:54:59,630 DEBUG [SCTPTransportServer](FSM-SPeer{Uri=aaa://vcmS6a; State=DOWN; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca; incCon{aaa://10.225.164.62:43395=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca} }_1-1) Sent a byte buffer of size [100] over SCTP 2016-11-08 10:54:59,630 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=DOWN; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca; incCon{aaa://10.225.164.62:43395=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca} }_1-1) Connection sent message [MessageImpl{commandCode=280, flags=128}] to peer [aaa://vcmS6a] over the network 2016-11-08 10:54:59,631 DEBUG AssociationImpl Tx : Ass=null PayloadData [dataLength=100, complete=true, unordered=false, payloadProtocolId=0, streamNumber=9, data= Start: 0 (0x00) End: 99 (0x63) Length: 100 (0x64) 00: 01 00 00 64 80 00 01 18 00 00 00 00 43 81 6D 62 | ...d.... ....C.mb 10: 5D C0 00 14 00 00 01 08 40 00 00 23 68 73 73 2E | ]....... @..#hss. 20: 76 6D 77 61 72 65 2E 6F 62 75 2D 65 6E 66 69 65 | vmware.o bu-enfie 30: 6C 64 2E 74 65 73 74 00 00 00 01 28 40 00 00 1F | ld.test. ...(@... 40: 76 6D 77 61 72 65 2E 6F 62 75 2D 65 6E 66 69 65 | vmware.o bu-enfie 50: 6C 64 2E 74 65 73 74 00 00 00 01 16 40 00 00 0C | ld.test. ....@... 60: 43 81 6F 67 -- -- -- -- -- -- -- -- -- -- -- -- | C.og
At 10:54:59,635 we receive the DWA_EVENT 2016-11-08 10:54:59,635 DEBUG AssociationImpl Rx : Ass=null PayloadData [dataLength=72, complete=true, unordered=false, payloadProtocolId=0, streamNumber=10, data= Start: 0 (0x00) End: 71 (0x47) Length: 72 (0x48) 00: 01 00 00 48 00 00 01 18 00 00 00 00 43 81 6D 62 | ...H.... ....C.mb 10: 5D C0 00 14 00 00 01 08 40 00 00 0E 76 63 6D 53 | ]....... @...vcmS 20: 36 61 00 00 00 00 01 28 40 00 00 18 62 72 6F 63 | 6a.....( @...broc 30: 61 64 65 2E 74 75 63 68 61 6E 6B 61 00 00 01 0C | ade.tuch anka.... 40: 40 00 00 0C 00 00 07 D1 -- -- -- -- -- -- -- -- | @....... ] 2016-11-08 10:54:59,635 DEBUG SCTPTransportServer$ServerAssociationListener SCTP Server received a message of length: [72] 2016-11-08 10:54:59,635 DEBUG SCTPServerConnection Received message of size [72] 2016-11-08 10:54:59,635 DEBUG SCTPServerConnection In onEvent for connection [aaa://10.225.164.62:43395]. Getting lock 2016-11-08 10:54:59,636 DEBUG SCTPServerConnection Listeners for aaa://10.225.164.62:43395 2016-11-08 10:54:59,636 DEBUG SCTPServerConnection Listener [org.jdiameter.client.impl.controller.PeerImpl$1@6ae9c084] 2016-11-08 10:54:59,636 DEBUG SCTPServerConnection Event Queue for aaa://10.225.164.62:43395 2016-11-08 10:54:59,636 DEBUG PeerImpl Receive message type [280] to peer [aaa://10.225.164.62:43395] 2016-11-08 10:54:59,636 DEBUG PeerFSMImpl Handling event with type [DWA_EVENT] 2016-11-08 10:54:59,636 DEBUG PeerFSMImpl Not performing validation to message since validator is DISABLED. 2016-11-08 10:54:59,636 DEBUG PeerFSMImpl Placing event [Event{name:DWA_EVENT, key:aaa://10.225.164.62:43395, object:MessageImpl{commandCode=280, flags=0}}] into linked blocking queue with remaining capacity: [10000].
At 10:55:31:686 we send a DWR_EVENT but you can check on the pcap file that this DWR_EVENT doesn't appear, any clue?
2016-11-08 10:55:31,686 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Sending timeout event 2016-11-08 10:55:31,687 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Handling event with type [TIMEOUT_EVENT] 2016-11-08 10:55:31,687 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Not performing validation to message since validator is DISABLED. 2016-11-08 10:55:31,687 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Placing event [Event{name:TIMEOUT_EVENT, key:null, object:null}] into linked blocking queue with remaining capacity: [10000]. 2016-11-08 10:55:31,687 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Got Event [Event{name:TIMEOUT_EVENT, key:null, object:null}] from Queue 2016-11-08 10:55:31,687 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Process event [Event{name:TIMEOUT_EVENT, key:null, object:null}]. Peer State is [OKAY] 2016-11-08 10:55:31,687 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Send DWR message 2016-11-08 10:55:31,687 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Calling connection to send message [MessageImpl{commandCode=280, flags=128}] to peer [aaa://vcmS6a] over the network 2016-11-08 10:55:31,687 DEBUG [SCTPTransportServer](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) About to send a byte buffer of size [100] over the SCTP 2016-11-08 10:55:31,687 DEBUG [SCTPTransportServer](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Sent a byte buffer of size [100] over SCTP 2016-11-08 10:55:31,687 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Connection sent message [MessageImpl{commandCode=280, flags=128}] to peer [aaa://vcmS6a] over the network 2016-11-08 10:55:31,687 DEBUG AssociationImpl Tx : Ass=null PayloadData [dataLength=100, complete=true, unordered=false, payloadProtocolId=0, streamNumber=10, data= Start: 0 (0x00) End: 99 (0x63) Length: 100 (0x64) 00: 01 00 00 64 80 00 01 18 00 00 00 00 43 81 6D 63 | ...d.... ....C.mc 10: 5D C0 00 16 00 00 01 08 40 00 00 23 68 73 73 2E | ]....... @..#hss. 20: 76 6D 77 61 72 65 2E 6F 62 75 2D 65 6E 66 69 65 | vmware.o bu-enfie
So before "sending" this ghost DWR_EVENT we don't have a DWA_EVENT in response, so the timer tries to send a new one DWR_EVENT at 10:55:41,701 but as you can see on the pcap file this event also doesn't appear
2016-11-08 10:55:41,701 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Sending timeout event 2016-11-08 10:55:41,702 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Handling event with type [TIMEOUT_EVENT] 2016-11-08 10:55:41,702 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Not performing validation to message since validator is DISABLED. 2016-11-08 10:55:41,702 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Placing event [Event{name:TIMEOUT_EVENT, key:null, object:null}] into linked blocking queue with remaining capacity: [10000]. 2016-11-08 10:55:41,702 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Got Event [Event{name:TIMEOUT_EVENT, key:null, object:null}] from Queue 2016-11-08 10:55:41,702 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Process event [Event{name:TIMEOUT_EVENT, key:null, object:null}]. Peer State is [OKAY] 2016-11-08 10:55:41,702 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Send DWR message 2016-11-08 10:55:41,702 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Calling connection to send message [MessageImpl{commandCode=280, flags=128}] to peer [aaa://vcmS6a] over the network 2016-11-08 10:55:41,702 DEBUG [SCTPTransportServer](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) About to send a byte buffer of size [100] over the SCTP 2016-11-08 10:55:41,702 DEBUG [SCTPTransportServer](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Sent a byte buffer of size [100] over SCTP 2016-11-08 10:55:41,702 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Connection sent message [MessageImpl{commandCode=280, flags=128}] to peer [aaa://vcmS6a] over the network 2016-11-08 10:55:41,703 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) SPeer{Uri=aaa://vcmS6a; State=OKAY; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca; incCon{} } FSM switch state: OKAY -> SUSPECT 2016-11-08 10:55:41,703 DEBUG AssociationImpl Tx : Ass=null PayloadData [dataLength=100, complete=true, unordered=false, payloadProtocolId=0, streamNumber=10, data= Start: 0 (0x00) End: 99 (0x63) Length: 100 (0x64) 00: 01 00 00 64 80 00 01 18 00 00 00 00 43 81 6D 64 | ...d.... ....C.md 10: 5D C0 00 17 00 00 01 08 40 00 00 23 68 73 73 2E | ]....... @..#hss. 20: 76 6D 77 61 72 65 2E 6F 62 75 2D 65 6E 66 69 65 | vmware.o bu-enfie 30: 6C 64 2E 74 65 73 74 00 00 00 01 28 40 00 00 1F | ld.test. ...(@... 40: 76 6D 77 61 72 65 2E 6F 62 75 2D 65 6E 66 69 65 | vmware.o bu-enfie
as you can see, there are HEARTBEAT and HEARTBEAT_ACK on the connection, so the connection is alive, but there is no trace of the two DWR_EVENTs, any clue?
For this test we use only one thread to avoid the "synchronization" problem, but at the end the same problem, the connection SHUTDOWNs.
Thanks
hi @SummaNetworks Did you attach the pcap file ? Thanks
It only accepts zip files :), here it is DWR_DWA.pcap.zip
Unable to replicate, please let us know if it happens again and/or any steps to reproduce.