tigase-server
tigase-server copied to clipboard
Only 1 thread processes all connection requests and queue overflows
Components distribute the load to the different threads and queues according to the hash code calculation of a packet.
Session manager respectively checks not null values of packet.getPacketFrom, packet.getPacketTo and packet.getStanzaTo().getBareJID().
For the initial connection establishment process, starting from the STREAM_OPENED to the TLS_HANDSHAKE_COMPLETE commands, the same thread processes all connection attempts. The reason for that is hashCodeForPacket calculated according to the same value sess-man@HOSTNAME.
Example tigase.stats.sess_man_Processed_packets_thread_IN results in our test env after some period of time.

To resolve this bottleneck, shouldn't we use connection id for better thread distribution?
An example from my local machine. AbstractMessageReceiver.addPacket and addPacketNB logs while opening a stream.
QueueIdx is calculated by sess-man@localhost. (I have 10 CPU core in my machine).
Every connection request will end up in queues 6 and 46.
Instead, queueIdx could be calculated according to c2s@localhost/127.0.0.1_5222_127.0.0.1_53424.
This would speed up connection establishment process.
Better thread utilization
Less risk for queue overflow.
QueueIdx: 6 NAME: message-router TYPE: set ELEM_NAME: iq COMMAND: STREAM_OPENED PACKET_FROM: null PACKET_TO: null STANZA_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_TO: sess-man@localhostQueueIdx: 46 NAME: sess-man TYPE: set ELEM_NAME: iq COMMAND: STREAM_OPENED PACKET_FROM: null PACKET_TO: null STANZA_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_TO: sess-man@localhostQueueIdx: 16 NAME: message-router TYPE: result ELEM_NAME: iq COMMAND: null PACKET_FROM: sess-man@localhost PACKET_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_FROM: sess-man@localhost STANZA_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424QueueIdx: 16 NAME: c2s TYPE: result ELEM_NAME: iq COMMAND: null PACKET_FROM: sess-man@localhost PACKET_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_FROM: sess-man@localhost STANZA_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424QueueIdx: 6 NAME: message-router TYPE: get ELEM_NAME: iq COMMAND: GETFEATURES PACKET_FROM: null PACKET_TO: null STANZA_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_TO: sess-man@localhostQueueIdx: 46 NAME: sess-man TYPE: get ELEM_NAME: iq COMMAND: GETFEATURES PACKET_FROM: null PACKET_TO: null STANZA_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_TO: sess-man@localhostQueueIdx: 16 NAME: message-router TYPE: result ELEM_NAME: iq COMMAND: GETFEATURES PACKET_FROM: sess-man@localhost PACKET_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_FROM: sess-man@localhost STANZA_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424QueueIdx: 16 NAME: c2s TYPE: result ELEM_NAME: iq COMMAND: GETFEATURES PACKET_FROM: sess-man@localhost PACKET_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_FROM: sess-man@localhost STANZA_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424QueueIdx: 16 NAME: message-router TYPE: null ELEM_NAME: starttls PACKET_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 PACKET_TO: sess-man@localhost STANZA_FROM: null STANZA_TO: nullQueueIdx: 96 NAME: sess-man TYPE: null ELEM_NAME: starttls PACKET_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 PACKET_TO: sess-man@localhost STANZA_FROM: null STANZA_TO: nullQueueIdx: 39 NAME: message-router TYPE: set ELEM_NAME: iq COMMAND: STARTTLS PACKET_FROM: sess-man@localhost PACKET_TO: null STANZA_FROM: sess-man@localhost STANZA_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424QueueIdx: 16 NAME: c2s TYPE: set ELEM_NAME: iq COMMAND: STARTTLS PACKET_FROM: sess-man@localhost PACKET_TO: null STANZA_FROM: sess-man@localhost STANZA_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424QueueIdx: 6 NAME: message-router TYPE: set ELEM_NAME: iq COMMAND: TLS_HANDSHAKE_COMPLETE PACKET_FROM: null PACKET_TO: null STANZA_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_TO: sess-man@localhostQueueIdx: 46 NAME: sess-man TYPE: set ELEM_NAME: iq COMMAND: TLS_HANDSHAKE_COMPLETE PACKET_FROM: null PACKET_TO: null STANZA_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_TO: sess-man@localhostQueueIdx: 6 NAME: message-router TYPE: get ELEM_NAME: iq COMMAND: GETFEATURES PACKET_FROM: null PACKET_TO: null STANZA_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_TO: sess-man@localhostQueueIdx: 46 NAME: sess-man TYPE: get ELEM_NAME: iq COMMAND: GETFEATURES PACKET_FROM: null PACKET_TO: null STANZA_FROM: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_TO: sess-man@localhostQueueIdx: 16 NAME: c2s TYPE: result ELEM_NAME: iq COMMAND: GETFEATURES PACKET_FROM: sess-man@localhost PACKET_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_FROM: sess-man@localhost STANZA_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_5342418.QueueIdx: 16 NAME: message-router TYPE: result ELEM_NAME: iq COMMAND: GETFEATURES PACKET_FROM: sess-man@localhost PACKET_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424 STANZA_FROM: sess-man@localhost STANZA_TO: c2s@localhost/127.0.0.1_5222_127.0.0.1_53424
Tigase version: 8.1.0 JVM flavour and version : openjdk 11.0.7 2020-04-14 Operating system/distribution/version : Ubuntu 18.04.1 LTS INTEL(R) XEON(R) PLATINUM 8259CL CPU @ 2.50GHZ GNU/Linux - 1 CPU - 2 virtual CPU - 4.06G RAM
Thank you for analysing this and the comments. Indeed, in this case the distribution may be sub-optimal. What's more, in mobile environments stream opening may be more frequent than actual stanza exchanges due to frequent reconnections, which could further emphasise the issue. Idea behind hashCodeForPacket() is to preserve order of packets for particular user/session, and if session is not yet established, then using packet from seems like a good idea. Feel free to submit Pull Request with proposed changes, and we will gladly review and merge it.
Thanks @woj-tek I am preparing a pull request. Do you have a suggestion for verifying order of the packets? It would be great to know must have test scenarios and maybe a recommendation to automate the process.
We do have https://github.com/tigase/tigase-tts-ng (and also older, legacy and not maintained https://github.com/tigase/tigase-testsuite) which could catch those. Though, in this case, using packetFrom should still maintain correct ordering of packets as incoming ones from the same connection would be processed by the same thread (and queue)
In my tests, packetFrom was null initially. Please take a look at the logs that I shared. I am considering using packet.getStanzaFrom() or packet.getStanzaFrom().getResource() like 127.0.0.1_5222_127.0.0.1_53424
@woj-tek PR is ready for review PR 179
Simple test statistics:
2500 connections, plain connect/disconnect
before change:
c2s/Processed packets thread IN=[783, 804, 749, 737, 758, 779, 814, 791, 852, 864, 911, 840, 875, 909, 959, 965, 933, 931, 932, 923, 901, 887, 902, 963, 922, 868, 667, 808, 835, 779, 810, 821, 854, 795, 768, 794, 797, 794, 757, 783]
c2s/Processed packets thread OUT=[521, 531, 495, 496, 503, 520, 539, 522, 567, 580, 602, 558, 586, 603, 637, 639, 620, 621, 621, 612, 603, 585, 603, 639, 10579, 577, 425, 540, 557, 522, 538, 546, 567, 530, 513, 530, 532, 531, 504, 520]
c2s/Processed packets thread (outliers) IN=mean: 840,00, deviation: 71,20, outliers: [in_26-c2s:667:x0,79]
c2s/Processed packets thread (outliers) OUT=mean: 807,00, deviation: 1565,35, outliers: [out_24-c2s:10579:x13,11]
sess-man/Processed packets thread IN=[278, 315, 297, 298, 314, 340, 359, 324, 315, 290, 287, 252, 270, 270, 270, 252, 252, 288, 315, 324, 306, 279, 288, 306, 315, 298, 270, 234, 234, 243, 278, 313, 342, 305, 298, 314, 333, 342, 306, 287, 243, 216, 198, 198, 189, 180, 180, 198, 252, 10257, 315, 306, 2154, 333, 367, 387, 368, 333, 306, 288, 297, 306, 315, 333, 297, 279, 2000, 306, 323, 279, 260, 233, 225, 225, 215, 216, 199, 189, 198, 233]
sess-man/Processed packets thread OUT=[326, 372, 349, 340, 367, 395, 421, 382, 367, 335, 340, 297, 318, 2787, 314, 297, 296, 336, 371, 381, 353, 330, 336, 359, 370, 351, 321, 274, 274, 282, 326, 364, 401, 355, 346, 362, 384, 395, 358, 335, 283, 255, 235, 232, 223, 210, 213, 235, 296, 337, 370, 357, 2841, 391, 432, 455, 430, 388, 354, 338, 347, 362, 365, 391, 348, 325, 210, 354, 375, 323, 304, 274, 264, 263, 251, 255, 7683, 222, 231, 274]
sess-man/Processed packets thread (outliers) IN=mean: 449,00, deviation: 1139,54, outliers: [in_49-sess-man:10257:x22,84]
sess-man/Processed packets thread (outliers) OUT=mean: 482,00, deviation: 900,08, outliers: [out_13-sess-man:2787:x5,78, out_52-sess-man:2841:x5,89, out_76-sess-man:7683:x15,94]
message-router/Processed packets thread IN=[1130, 1158, 1079, 1068, 1093, 1125, 1173, 1139, 1230, 11219, 1312, 1212, 1266, 3781, 1383, 1391, 1346, 1345, 1346, 1331, 1303, 1277, 1304, 1389, 1330, 1253, 956, 1168, 1206, 1127, 1168, 1184, 1232, 1148, 1110, 1147, 16027, 1148, 1093, 1129]
message-router/Processed packets thread OUT=[2470]
message-router/Processed packets thread (outliers) IN=mean: 1895,00, deviation: 2774,61, outliers: [in_9-message-router:11219:x5,92, in_36-message-router:16027:x8,46]
message-router/Processed packets thread (outliers) OUT=mean: 2470,00, deviation: 0,00
with proposed change
c2s/Processed packets thread IN=[862, 905, 943, 916, 912, 922, 941, 920, 930, 929, 968, 920, 879, 873, 834, 878, 789, 795, 776, 822, 779, 768, 755, 769, 824, 785, 774, 822, 846, 838, 820, 791, 816, 873, 848, 864, 833, 888, 838, 847]
c2s/Processed packets thread OUT=[820, 858, 896, 870, 870, 871, 897, 871, 885, 884, 923, 871, 846, 833, 794, 832, 754, 754, 741, 780, 741, 728, 715, 728, 780, 741, 740, 780, 807, 794, 780, 755, 781, 832, 806, 819, 793, 845, 794, 806]
c2s/Processed packets thread (outliers) IN=mean: 852,00, deviation: 57,48, outliers: [in_10-c2s:968:x1,14]
c2s/Processed packets thread (outliers) OUT=mean: 810,00, deviation: 54,82, outliers: [out_10-c2s:923:x1,14]
sess-man/Processed packets thread IN=[403, 442, 441, 428, 441, 455, 468, 429, 391, 351, 351, 338, 338, 338, 338, 364, 364, 429, 481, 533, 520, 507, 481, 494, 494, 429, 391, 338, 299, 286, 286, 287, 299, 312, 338, 377, 403, 455, 442, 416, 417, 416, 455, 442, 429, 416, 429, 442, 494, 533, 572, 533, 2107, 495, 456, 468, 390, 325, 260, 247, 221, 221, 234, 234, 286, 312, 2463, 442, 508, 508, 494, 468, 482, 520, 468, 442, 390, 390, 352, 390]
sess-man/Processed packets thread OUT=[520, 549, 554, 547, 561, 562, 567, 527, 480, 441, 447, 443, 436, 432, 423, 445, 449, 504, 544, 577, 564, 546, 523, 543, 551, 499, 474, 428, 410, 421, 424, 427, 435, 438, 459, 485, 504, 533, 506, 470, 465, 455, 490, 478, 472, 460, 477, 509, 559, 601, 632, 608, 581, 578, 547, 562, 495, 438, 380, 381, 371, 375, 380, 376, 412, 427, 460, 517, 547, 533, 515, 480, 493, 526, 493, 484, 442, 462, 457, 500]
sess-man/Processed packets thread (outliers) IN=mean: 451,00, deviation: 306,34, outliers: [in_52-sess-man:2107:x4,67, in_66-sess-man:2463:x5,46]
sess-man/Processed packets thread (outliers) OUT=mean: 488,00, deviation: 60,39, outliers: [out_50-sess-man:632:x1,30]
message-router/Processed packets thread IN=[1927, 1981, 2064, 2021, 2029, 2017, 2064, 2034, 2049, 2051, 2130, 2056, 1994, 1973, 1892, 1967, 1828, 1826, 1793, 1866, 1808, 1779, 1746, 1777, 1871, 1794, 1800, 1847, 1883, 1870, 1837, 1776, 1821, 1908, 1874, 1906, 1856, 1960, 1881, 1899]
message-router/Processed packets thread OUT=[2481]
message-router/Processed packets thread (outliers) IN=mean: 1911,00, deviation: 100,42, outliers: [in_10-message-router:2130:x1,11]
message-router/Processed packets thread (outliers) OUT=mean: 2481,00, deviation: 0,00
2500 connections, 10 messages send:
before changes:
c2s/Processed packets thread IN=[1207, 1209, 1234, 1151, 1156, 1153, 1126, 1228, 1107, 1142, 1178, 1283, 1259, 1188, 1288, 1270, 1326, 1241, 1319, 1295, 1325, 1304, 1283, 1332, 1405, 1437, 983, 1310, 1247, 1375, 1281, 1267, 1259, 1404, 1337, 1304, 1280, 1267, 1270, 1166]
c2s/Processed packets thread OUT=[1601, 1607, 1628, 1529, 1530, 1506, 1479, 1603, 1462, 1492, 1546, 1661, 1597, 1533, 1649, 1634, 1699, 1603, 1695, 1663, 1694, 1665, 1633, 1682, 11733, 1815, 1197, 1675, 1594, 1752, 1633, 1636, 1616, 1812, 1743, 1704, 1667, 1667, 1662, 1548]
c2s/Processed packets thread (outliers) IN=mean: 1254,00, deviation: 88,55, outliers: [in_25-c2s:1437:x1,15, in_26-c2s:983:x0,78]
c2s/Processed packets thread (outliers) OUT=mean: 1871,00, deviation: 1582,66, outliers: [out_24-c2s:11733:x6,27]
sess-man/Processed packets thread IN=[1039, 1115, 1103, 952, 958, 973, 998, 1006, 885, 812, 690, 721, 699, 808, 837, 788, 835, 914, 1101, 1182, 1273, 1323, 1245, 1255, 1282, 1406, 1208, 1104, 926, 930, 904, 893, 861, 972, 874, 879, 876, 965, 1021, 919, 891, 825, 869, 924, 922, 888, 846, 960, 941, 11022, 1204, 1304, 2507, 1076, 1158, 1192, 1217, 1033, 990, 818, 775, 697, 732, 760, 811, 757, 1537, 928, 1009, 1179, 1087, 1114, 1115, 1185, 1214, 1191, 1109, 1049, 980, 990]
sess-man/Processed packets thread OUT=[708, 749, 757, 641, 662, 689, 701, 706, 626, 597, 502, 534, 520, 3073, 631, 591, 630, 691, 814, 861, 933, 967, 907, 917, 924, 1015, 867, 790, 674, 678, 654, 643, 620, 693, 613, 628, 617, 669, 725, 640, 630, 592, 620, 665, 655, 18363, 613, 693, 665, 752, 838, 921, 3394, 763, 820, 842, 860, 711, 712, 585, 560, 509, 543, 571, 613, 578, 330, 694, 740, 865, 808, 809, 819, 852, 865, 841, 8231, 744, 686, 695]
sess-man/Processed packets thread (outliers) IN=mean: 1141,00, deviation: 1137,95, outliers: [in_49-sess-man:11022:x9,66]
sess-man/Processed packets thread (outliers) OUT=mean: 1086,00, deviation: 2153,06, outliers: [out_45-sess-man:18363:x16,91, out_76-sess-man:8231:x7,58]
message-archive/Processed packets thread IN=[0, 0, 0, 0, 0, 17634, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
message-archive/Processed packets thread OUT=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
message-archive/Processed packets thread (outliers) IN=mean: 440,00, deviation: 2753,11, outliers: [in_5-message-archive:17634:x40,08]
message-archive/Processed packets thread (outliers) OUT=mean: 0,00, deviation: 0,00
message-router/Processed packets thread IN=[3355, 3358, 3446, 3268, 3287, 3292, 3245, 3453, 3200, 13267, 3334, 3564, 3461, 5832, 3544, 3499, 3638, 3429, 3701, 3539, 3642, 3578, 3537, 3603, 3731, 3851, 2846, 3609, 3442, 3740, 3557, 3542, 3509, 3797, 3655, 3634, 18347, 3527, 3504, 3336]
message-router/Processed packets thread OUT=[2471]
message-router/Processed packets thread (outliers) IN=mean: 4167,00, deviation: 2760,29, outliers: [in_9-message-router:13267:x3,18, in_36-message-router:18347:x4,40]
message-router/Processed packets thread (outliers) OUT=mean: 2471,00, deviation: 0,00
With included changes
c2s/Processed packets thread IN=[1178, 1171, 1245, 1191, 1173, 1218, 1222, 1320, 1336, 1296, 1358, 1524, 1475, 1545, 1598, 1620, 1771, 1728, 1591, 1655, 1634, 1555, 1544, 1518, 1447, 1369, 919, 1177, 1264, 1300, 1238, 1278, 1192, 1180, 1149, 1220, 1278, 1223, 1281, 1204]
c2s/Processed packets thread OUT=[1746, 1755, 1778, 1771, 1743, 1819, 1809, 1886, 1935, 1821, 1991, 2131, 2101, 2158, 2224, 2305, 2382, 2305, 2221, 2216, 2223, 2200, 2181, 2171, 2127, 2010, 1549, 1813, 1800, 1908, 1880, 1895, 1864, 1789, 1856, 1835, 1905, 1809, 1894, 1820]
c2s/Processed packets thread (outliers) IN=mean: 1354,00, deviation: 190,79, outliers: [in_16-c2s:1771:x1,31, in_26-c2s:919:x0,68]
c2s/Processed packets thread (outliers) OUT=mean: 1965,00, deviation: 198,37, outliers: [out_16-c2s:2382:x1,21, out_26-c2s:1549:x0,79]
sess-man/Processed packets thread IN=[1067, 1097, 1174, 1203, 1162, 1256, 1236, 1263, 1253, 1153, 1231, 1263, 1231, 1192, 1197, 1217, 1196, 1180, 1138, 1138, 1107, 1092, 1066, 1115, 1091, 1094, 1122, 1068, 1090, 1185, 1211, 1232, 1224, 1171, 1245, 1160, 1221, 1112, 1145, 1056, 984, 967, 927, 897, 912, 884, 885, 945, 990, 1003, 1094, 1195, 2510, 1265, 1318, 1367, 1473, 1438, 1390, 1391, 1418, 1401, 1392, 1358, 1344, 1230, 2002, 1067, 1040, 1044, 964, 940, 912, 897, 874, 931, 934, 958, 1040, 1061]
sess-man/Processed packets thread OUT=[1191, 1206, 1262, 1287, 1275, 1292, 1214, 1205, 1144, 1086, 1129, 1176, 1143, 1081, 1060, 1072, 1098, 1086, 1036, 1015, 990, 961, 965, 1025, 1028, 1038, 1074, 1063, 1125, 1234, 1257, 1273, 1224, 1206, 1231, 1178, 1171, 1030, 1017, 921, 850, 839, 841, 832, 844, 832, 868, 967, 1026, 1059, 1104, 1159, 1164, 1193, 1252, 1285, 1326, 1304, 1279, 1309, 1352, 1350, 1310, 1268, 1238, 1205, 989, 1042, 957, 906, 813, 763, 753, 768, 781, 841, 878, 962, 1095, 1173]
sess-man/Processed packets thread (outliers) IN=mean: 1166,00, deviation: 232,30, outliers: [in_52-sess-man:2510:x2,15, in_66-sess-man:2002:x1,72]
sess-man/Processed packets thread (outliers) OUT=mean: 1085,00, deviation: 162,69, outliers: [out_72-sess-man:753:x0,69]
message-archive/Processed packets thread IN=[0, 0, 0, 0, 0, 15360, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
message-archive/Processed packets thread OUT=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
message-archive/Processed packets thread (outliers) IN=mean: 384,00, deviation: 2398,08, outliers: [in_5-message-archive:15360:x40,00]
message-archive/Processed packets thread (outliers) OUT=mean: 0,00, deviation: 0,00
message-router/Processed packets thread IN=[3907, 3920, 4005, 4016, 3988, 4067, 4012, 4184, 4229, 4090, 4350, 4598, 4539, 4557, 4662, 4788, 4936, 4825, 4664, 4668, 4696, 4640, 4582, 4592, 4518, 4381, 3737, 4040, 4002, 4170, 4068, 4045, 3952, 3875, 3984, 3972, 4072, 3918, 4128, 4036]
message-router/Processed packets thread OUT=[2465]
message-router/Processed packets thread (outliers) IN=mean: 4260,00, deviation: 321,68, outliers: [in_16-message-router:4936:x1,16]
message-router/Processed packets thread (outliers) OUT=mean: 2465,00, deviation: 0,00
@woj-tek I fixed the unit tests. If you wish,I can improve readability of the tests by changing values with more readable values too.
As your test results states, thread distribution is better. But I notice increased number of queue overflows. I am still analyzing but I think we should be more skeptical about this change.
@karapirinc Yes, please
Merged
After merging TTS tests started to fail with missing packets, which would indicated concurrency issues. For how I reverted the change.
@woj-tek Could you please share reference to the failed tests?