cacti icon indicating copy to clipboard operation
cacti copied to clipboard

[1.2.22] - When remote poller is in offline mode GUI inaccesible and poller times out

Open bmfmancini opened this issue 3 years ago • 14 comments

We have been performing testing on 1.2.22

During failure testing we found that if you fail the master poller the remote poller GUI is unusable and spine times out

spine spends a lot of time spawning script server processes this behaviour is only seen in offline mode

2022-08-18 14:23:37 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[1] PHP Script Server Child FORK Success
2022-08-18 14:23:55 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[1] Confirmed PHP Script Server running using readfd[17], writefd[16]
2022-08-18 14:23:55 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[2] PHP Script Server Routine Starting
2022-08-18 14:23:55 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[2] PHP Script Server About to FORK Child Process
2022-08-18 14:23:55 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[2] PHP Script Server Child FORK Success
2022-08-18 14:24:13 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[2] Confirmed PHP Script Server running using readfd[19], writefd[18]
2022-08-18 14:24:13 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[3] PHP Script Server Routine Starting
2022-08-18 14:24:13 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[3] PHP Script Server About to FORK Child Process
2022-08-18 14:24:13 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[3] PHP Script Server Child FORK Success
2022-08-18 14:24:31 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[3] Confirmed PHP Script Server running using readfd[21], writefd[20]
2022-08-18 14:24:31 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[4] PHP Script Server Routine Starting
2022-08-18 14:24:31 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[4] PHP Script Server About to FORK Child Process
2022-08-18 14:24:31 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[4] PHP Script Server Child FORK Success
2022-08-18 14:24:49 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[4] Confirmed PHP Script Server running using readfd[23], writefd[22]
2022-08-18 14:24:49 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[5] PHP Script Server Routine Starting
2022-08-18 14:24:49 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[5] PHP Script Server About to FORK Child Process

I checked via packet capture for excessive retries to the primary poller but did not find any

In the logs I am seeing the remote trying to push data to main constantly

2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[759]:poller_push_data_to_main(), /lib/poller.php[1957]:poller_push_table(), /lib/poller.php[1995]:db_execute(), /lib/database.php[272]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Exec Failed!, Error: MySQL server has gone away
2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[759]:poller_push_data_to_main(), /lib/poller.php[1957]:poller_push_table(), /lib/poller.php[1995]:db_execute(), /lib/database.php[272]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Exec Failed!, Error: MySQL server has gone away
2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[759]:poller_push_data_to_main(), /lib/poller.php[1957]:poller_push_table(), /lib/poller.php[1995]:db_execute(), /lib/database.php[272]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Exec Failed!, Error: MySQL server has gone away
2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[759]:poller_push_data_to_main(), /lib/poller.php[1946]:poller_push_table(), /lib/poller.php[1995]:db_execute(), /lib/database.php[272]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Exec Failed!, Error: MySQL server has gone away
2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[729]:db_fetch_cell_prepared(), /lib/database.php[484]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Cell Failed!, Error: MySQL server has gone away

bmfmancini avatar Aug 18 '22 18:08 bmfmancini

You are running out of connections.

TheWitness avatar Aug 19 '22 11:08 TheWitness

Let me check I didnt see any too many connections logs

bmfmancini avatar Aug 19 '22 11:08 bmfmancini

connections are fine

MariaDB [cacti]> show status like '%max_used%';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 242   |
+----------------------+-------+
1 row in set (0.00 sec)

MariaDB [cacti]> show variables where variable_name =  'max_connections';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 5000  |
+-----------------+-------+
1 row in set (0.00 sec)

bmfmancini avatar Aug 19 '22 12:08 bmfmancini

Ok interesting find if you fail the MariaDB process on the main poller this issue is not seen the remote poller is accessible and polling completes without issues

when the primary server is not reachable when the network connection is down that's when the remote starts acting up the local GUI is not accessible and the logs show polling timing out but the boost table has valid rows

when the network is restored to the primary recovery does kick in and valid records are passed to the main's boost table

bmfmancini avatar Aug 19 '22 13:08 bmfmancini

You should reduce your timeout and retry number. The default's are too high.

TheWitness avatar Aug 21 '22 13:08 TheWitness

Yea I thought the same

I'll try that on Monday

bmfmancini avatar Aug 21 '22 13:08 bmfmancini

Tested the script server timeout dropped it from the default 40 seconds to 20 no difference

bmfmancini avatar Aug 22 '22 13:08 bmfmancini

Brought down retry for spine from 5 to 3 same result

bmfmancini avatar Aug 22 '22 14:08 bmfmancini

I used the same process as you with 1 remote and can not repeat your problem. This might be a plugin issue. What plugins are you running?

TheWitness avatar Aug 24 '22 01:08 TheWitness

I think it's only thold and syslog I'll check in the morning

On Tue., Aug. 23, 2022, 9:03 p.m. TheWitness, @.***> wrote:

I used the same process as you with 1 remote and can not repeat your problem. This might be a plugin issue. What plugins are you running?

— Reply to this email directly, view it on GitHub https://github.com/Cacti/cacti/issues/4896#issuecomment-1225045215, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGEXTB27CP2VBGQMCZKSSTV2VYHZANCNFSM566DOYMA . You are receiving this because you authored the thread.Message ID: @.***>

bmfmancini avatar Aug 24 '22 01:08 bmfmancini

@TheWitness and I spoke about this to create a setting for main server timeout on the remote pollers using a PDO timeout

example

$DBH = new PDO(
    "mysql:host=$host;dbname=$dbname", 
    $username, 
    $password,
    array(
        PDO::ATTR_TIMEOUT => 5, // in seconds
        PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
    )
);

The reason for this is the MySQL/MariaDB client take longer to timeout when the server does not respond if the primary DB is simply down but the server is up the response is connection refused which is immediate while the timeout will wait longer and attempt to retry

bmfmancini avatar Aug 24 '22 15:08 bmfmancini

Also found the following

when the primary server is offline you will see this in the error_log of httpd

[Wed Aug 24 10:48:10.762636 2022] [php7:error] [pid 18633] [client :34869] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 394
[Wed Aug 24 12:23:30.729189 2022] [php7:error] [pid 7519] [client :38877] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.738659 2022] [php7:error] [pid 4446] [client :38763] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.748958 2022] [php7:error] [pid 6597] [client :38782] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.756671 2022] [php7:error] [pid 7965] [client :38736] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.769023 2022] [php7:error] [pid 10217] [client :38753] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.878296 2022] [php7:error] [pid 1652] [client :38732] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.990902 2022] [php7:error] [pid 4440] [client :38948] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:31.041406 2022] [php7:error] [pid 1425] [client :38745] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548

Function at line 548

function db_fetch_row_prepared($sql, $params = array(), $log = true, $db_conn = false) {
        global $config;

        if (!empty($config['DEBUG_SQL_FLOW'])) {
                db_echo_sql('db_fetch_row_prepared(\'' . clean_up_lines($sql) . '\', $params = (\'' . implode('\', \'', $params) . '\'), $log = ' . $log . ', $db_conn = ' . ($db_conn ? 'true' : 'false') .')' . "\n");
        }

        return db_execute_prepared($sql, $params, $log, $db_conn, 'Row', false, 'db_fetch_row_return');
}

bmfmancini avatar Aug 24 '22 16:08 bmfmancini

I think we need to have a setting in the GUI to keep the poller offline from the GUI perspective until it comes back up. Something that we need to save in the session. I think this is the only good way to solve this issue.

TheWitness avatar Aug 25 '22 11:08 TheWitness

It's likely I'll be offline for a few days. We'll see.

TheWitness avatar Aug 25 '22 11:08 TheWitness

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 25 '22 00:10 github-actions[bot]

Bump to keep the ticket open

On Mon., Oct. 24, 2022, 8:05 p.m. github-actions[bot], < @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

— Reply to this email directly, view it on GitHub https://github.com/Cacti/cacti/issues/4896#issuecomment-1289798591, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGEXTEEL6WUXNIAYFP45WLWE4P3JANCNFSM566DOYMA . You are receiving this because you authored the thread.Message ID: @.***>

bmfmancini avatar Oct 25 '22 00:10 bmfmancini