ambari-cassandra-service icon indicating copy to clipboard operation
ambari-cassandra-service copied to clipboard

WebUI shows no nodes live when they're actually up and pass health checks

Open seglo opened this issue 8 years ago • 14 comments

I was able to get the plugin working. I'm using this on CentOS and it was required that I install the datastax repo for yum first before anything would work (can this be automated?), but my main issue now is the UI is reporting inconsistent information.

The health checks for the "Cluster Nodes" is working (why is it called this? shouldn't they be more descriptive like "C* Nodes"?), but the Ambari UI shows the following:

ambari-cassandra (ignore the 4 warning alerts, they're not related to Cassandra)

When I run a nodetool status you can see all my nodes are up:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.147.0.23  87.84 KB   256     51.4%             300c7c50-e1ca-4979-8fc4-0d7bf48e766b  RAC1
UN  10.147.0.22  192.27 KB  256     52.0%             521ffe0d-4a32-4e29-8862-d9297c53e8d2  RAC1
UN  10.147.0.21  234.35 KB  256     48.9%             3c1f75d3-c111-45f0-85bc-cc0a795c5cad  RAC1
UN  10.147.0.24  241.48 KB  256     47.7%             24b59f0b-24d4-4322-900c-4657f37e05af  RAC1

seglo avatar Apr 26 '16 12:04 seglo

I've just redeployed a cluster and the issue remains. Any suggestions?

seglo avatar May 05 '16 13:05 seglo

This should not happen. You can check Ambari-agent logs and server logs if there are any exceptions.

ajak6 avatar May 05 '16 15:05 ajak6

I get these errors for my 3 C* nodes in ambari-agent.log.

2016-05-05 16:04:58,281 [CRITICAL] [Cassandra] [Cassandra_service] (Cassandra Service Process) Connection failed: [Errno 111] Connection refused to ip-10-147-0-23.ec2.internal:7000
2016-05-05 16:05:01,265 [CRITICAL] [Cassandra] [Cassandra_service] (Cassandra Service Process) Connection failed: [Errno 111] Connection refused to ip-10-147-0-22.ec2.internal:7000
2016-05-05 16:05:03,661 [CRITICAL] [Cassandra] [Cassandra_service] (Cassandra Service Process) Connection failed: [Errno 111] Connection refused to ip-10-147-0-21.ec2.internal:7000

Yet I can connect to these host:port's from the machine ambari-server is installed on.

[centos@ip-10-147-0-10 ambari-server]$ telnet ip-10-147-0-21.ec2.internal 7000
Trying 10.147.0.21...
Connected to ip-10-147-0-21.ec2.internal.

I also have no problem running CQLSH and connecting to the cluster.

seglo avatar May 05 '16 16:05 seglo

Are you able to resolve the issue?

ajak6 avatar May 26 '16 17:05 ajak6

No I haven't. I was going to look into it some more soon. Has anyone else reported this problem?

seglo avatar May 26 '16 17:05 seglo

What's really strange about this is that the heartbeats seem to be working fine and Cassandra is inded running (notice it says "No Alerts"), but this summary window says 0/3 nodes are live. What part of the plugin code would be responsible to indicating with a Cluster Node is live or not on this view?

ambari-cassanda-cluster-nodes

seglo avatar May 27 '16 20:05 seglo

Probably a symptom of the same problem. When I go into a specific host it shows the Cassandra service as not started, even though it's running.

ambari-cassandra-server-not-started

seglo avatar May 27 '16 21:05 seglo

This might be an issue with the status function. Can you please confirm if there are no exceptions being thrown here?

The recommended way for defining the status function is as follows: Run some command to check if the component is running.

  • If the component is running, do not throw any errors, 0 return code on running the command.
  • If the component is not running, raise ComponentIsNotRunning exception.

mithmatt avatar May 30 '16 18:05 mithmatt

@mithmatt I'll add some exception handling and confirm the return code.

Earlier I did actually stick a debug statement in the status function, but it never appeared to be executed.

seglo avatar May 31 '16 01:05 seglo

the status function in the python file is executed by ambari for the heartbeat. I tried reinstalling the service and I don't see the issue. screen shot 2016-06-02 at 11 53 45 am

What OS version are you using? What is the HDP stack version you are using? What is the ambari version? Try changing the status method in cassandra_master.py to check the pid file by giving the path of pid in check_process_status method.

ajak6 avatar Jun 02 '16 19:06 ajak6

For some reason service cassandra status was returning an exit code of 3 even though the service was running successfully.

I'm running CentOS 7, so I'm using systemd. The exit code of the equivalent systemd command returned a 0 exit code. When I updated the status command in cassandra_master.py to systemctl status ambari-service the "warning" icon flipped to an "ok".

[centos@ip-10-147-0-21 ~]$ ./saferuncommand.sh sudo systemctl status cassandra
● cassandra.service - SYSV: Starts and stops Cassandra
   Loaded: loaded (/etc/rc.d/init.d/cassandra)
   Active: active (exited) since Thu 2016-06-02 17:23:02 UTC; 1 day 2h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 32132 ExecStop=/etc/rc.d/init.d/cassandra stop (code=exited, status=1/FAILURE)
  Process: 32182 ExecStart=/etc/rc.d/init.d/cassandra start (code=exited, status=0/SUCCESS)

Jun 02 17:23:02 ip-10-147-0-21 systemd[1]: Starting SYSV: Starts and stops Cassandra...
Jun 02 17:23:02 ip-10-147-0-21 su[32189]: (to cassandra) root on none
Jun 02 17:23:02 ip-10-147-0-21 systemd[1]: Started SYSV: Starts and stops Cassandra.
Jun 02 17:23:02 ip-10-147-0-21 cassandra[32182]: Starting Cassandra: OK

0

[centos@ip-10-147-0-21 ~]$ ./saferuncommand.sh sudo service cassandra status                                                                                                                                                                                                           
● cassandra.service - SYSV: Starts and stops Cassandra
   Loaded: loaded (/etc/rc.d/init.d/cassandra)
   Active: active (exited) since Thu 2016-06-02 17:23:02 UTC; 1 day 2h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 32132 ExecStop=/etc/rc.d/init.d/cassandra stop (code=exited, status=1/FAILURE)
  Process: 32182 ExecStart=/etc/rc.d/init.d/cassandra start (code=exited, status=0/SUCCESS)

Jun 02 17:23:02 ip-10-147-0-21 systemd[1]: Starting SYSV: Starts and stops Cassandra...
Jun 02 17:23:02 ip-10-147-0-21 su[32189]: (to cassandra) root on none
Jun 02 17:23:02 ip-10-147-0-21 systemd[1]: Started SYSV: Starts and stops Cassandra.
Jun 02 17:23:02 ip-10-147-0-21 cassandra[32182]: Starting Cassandra: OK

3

seglo avatar Jun 03 '16 20:06 seglo

Yes for centos its good to use sysmtectl. If it is resolved close the issue.

ajak6 avatar Jun 03 '16 20:06 ajak6

Would you accept a PR that switches based on whether systemctl is present?

    def status(self, env):
        import params
        env.set_params(params)
        status_cmd = format("""
            if hash systemctl 2>/dev/null; then
              systemctl status cassandra
            else
              service cassandra status
            fi""")
        Execute(status_cmd)
        print 'Status of the Master'

seglo avatar Jun 03 '16 20:06 seglo

@seglo 's solution worked for me.

I had the same issue on the same OS (CentOS).

capture

ghost avatar Sep 07 '16 11:09 ghost