aerospike_node_up 0
Hi!
Thanks for this great tool!
I'm having some issues getting the exporter working. I'm using the 1.8.0 binary. What versions of Aerospike are compatible? I'm using community edition 3.6.3.
2020/04/28 23:25:41 starting asprom. listening on :9145
2020/04/28 23:25:45 latency: missing measurements line
2020/04/28 23:25:50 latency: missing measurements line
2020/04/28 23:25:55 latency: missing measurements line
$ asinfo -l -v latency:
reads:23:29:34-GMT,ops/sec,>1ms,>8ms,>64ms
23:29:44,0.0,0.00,0.00,0.00
writes_master:23:29:34-GMT,ops/sec,>1ms,>8ms,>64ms
23:29:44,0.0,0.00,0.00,0.00
proxy:23:29:34-GMT,ops/sec,>1ms,>8ms,>64ms
23:29:44,0.0,0.00,0.00,0.00
udf:23:29:34-GMT,ops/sec,>1ms,>8ms,>64ms
23:29:44,0.0,0.00,0.00,0.00
query:23:29:34-GMT,ops/sec,>1ms,>8ms,>64ms
23:29:44,0.0,0.00,0.00,0.00
It looks like the node is not up?
# HELP aerospike_node_scrapes_total Total number of times Aerospike was scraped for metrics.
# TYPE aerospike_node_scrapes_total counter
aerospike_node_scrapes_total 7
# HELP aerospike_node_up Is this node up
# TYPE aerospike_node_up gauge
aerospike_node_up 0
Any reason you're using such an old version?
We're using 3.14.1.1 in production, and 3.6.3 in our staging environment. Haven't upgraded to version 4 because it has a limit of 2 namespaces for community edition.
Fair enough :-) I'll try to take a look at that version of the CE and the exporter later. I'll let you know.
@rodmccutcheon Can you describe your setup a bit? Because there doesn't seem to be any inherent incompatibility between the two versions.
What I did to test:
#term1
❯ docker run --name aerospike -p 9145:9145 aerospike/aerospike-server:3.6.3
#term2
❯ docker cp asprom aerospike:/tmp
❯ docker exec -it aerospike /bin/bash ✘ 1
root@10a6fd378e63:/# /tmp/asprom
2020/04/29 00:38:15 starting asprom. listening on :9145
#term3
❯ curl localhost:9145/metrics ✘ 1
# HELP aerospike_node_batch_index_complete batch index complete
# TYPE aerospike_node_batch_index_complete gauge
aerospike_node_batch_index_complete 0
# HELP aerospike_node_batch_index_initiate batch index initiate
# TYPE aerospike_node_batch_index_initiate gauge
aerospike_node_batch_index_initiate 0
# HELP aerospike_node_batch_index_timeout batch index timeout
# TYPE aerospike_node_batch_index_timeout gauge
aerospike_node_batch_index_timeout 0
# HELP aerospike_node_batch_index_unused_buffers batch index unused buffers
# TYPE aerospike_node_batch_index_unused_buffers gauge
aerospike_node_batch_index_unused_buffers 0
# HELP aerospike_node_batch_initiate batch initiate
# TYPE aerospike_node_batch_initiate gauge
aerospike_node_batch_initiate 0
# HELP aerospike_node_batch_queue batch queue
# TYPE aerospike_node_batch_queue gauge
aerospike_node_batch_queue 0
# HELP aerospike_node_batch_timeout batch timeout
# TYPE aerospike_node_batch_timeout gauge
aerospike_node_batch_timeout 0
# HELP aerospike_node_client_connections client connections
# TYPE aerospike_node_client_connections gauge
aerospike_node_client_connections 1
# HELP aerospike_node_cluster_size cluster size
# TYPE aerospike_node_cluster_size gauge
aerospike_node_cluster_size 1
# HELP aerospike_node_delete_queue delete queue
# TYPE aerospike_node_delete_queue gauge
aerospike_node_delete_queue 0
# HELP aerospike_node_heartbeat_received_foreign heartbeat received foreign
# TYPE aerospike_node_heartbeat_received_foreign counter
aerospike_node_heartbeat_received_foreign 0
# HELP aerospike_node_heartbeat_received_self heartbeat received self
# TYPE aerospike_node_heartbeat_received_self counter
aerospike_node_heartbeat_received_self 0
# HELP aerospike_node_info_queue info queue
# TYPE aerospike_node_info_queue gauge
aerospike_node_info_queue 0
# HELP aerospike_node_objects objects
# TYPE aerospike_node_objects gauge
aerospike_node_objects 0
# HELP aerospike_node_query_long_running query long running
# TYPE aerospike_node_query_long_running gauge
aerospike_node_query_long_running 0
# HELP aerospike_node_query_short_running query short running
# TYPE aerospike_node_query_short_running gauge
aerospike_node_query_short_running 0
# HELP aerospike_node_reaped_fds reaped fds
# TYPE aerospike_node_reaped_fds counter
aerospike_node_reaped_fds 0
# HELP aerospike_node_scans_active scans active
# TYPE aerospike_node_scans_active gauge
aerospike_node_scans_active 0
# HELP aerospike_node_scrapes_total Total number of times Aerospike was scraped for metrics.
# TYPE aerospike_node_scrapes_total counter
aerospike_node_scrapes_total 1
# HELP aerospike_node_sindex_gc_garbage_cleaned sindex gc garbage cleaned
# TYPE aerospike_node_sindex_gc_garbage_cleaned gauge
aerospike_node_sindex_gc_garbage_cleaned 0
# HELP aerospike_node_sindex_gc_garbage_found sindex gc garbage found
# TYPE aerospike_node_sindex_gc_garbage_found gauge
aerospike_node_sindex_gc_garbage_found 0
# HELP aerospike_node_sindex_gc_list_creation_time sindex gc list creation time
# TYPE aerospike_node_sindex_gc_list_creation_time gauge
aerospike_node_sindex_gc_list_creation_time 0
# HELP aerospike_node_sindex_gc_list_deletion_time sindex gc list deletion time
# TYPE aerospike_node_sindex_gc_list_deletion_time gauge
aerospike_node_sindex_gc_list_deletion_time 0
# HELP aerospike_node_sindex_gc_locktimedout sindex gc locktimedout
# TYPE aerospike_node_sindex_gc_locktimedout gauge
aerospike_node_sindex_gc_locktimedout 0
# HELP aerospike_node_sindex_gc_objects_validated sindex gc objects validated
# TYPE aerospike_node_sindex_gc_objects_validated gauge
aerospike_node_sindex_gc_objects_validated 0
# HELP aerospike_node_sindex_ucgarbage_found sindex ucgarbage found
# TYPE aerospike_node_sindex_ucgarbage_found gauge
aerospike_node_sindex_ucgarbage_found 0
# HELP aerospike_node_system_free_mem_pct system free mem pct
# TYPE aerospike_node_system_free_mem_pct gauge
aerospike_node_system_free_mem_pct 85
# HELP aerospike_node_up Is this node up
# TYPE aerospike_node_up gauge
aerospike_node_up 1
# HELP aerospike_node_uptime uptime
# TYPE aerospike_node_uptime counter
aerospike_node_uptime 21
# HELP aerospike_ns_evict_tenths_pct evict tenths pct
# TYPE aerospike_ns_evict_tenths_pct gauge
aerospike_ns_evict_tenths_pct{namespace="test"} 5
# HELP aerospike_ns_high_water_disk_pct high water disk pct
# TYPE aerospike_ns_high_water_disk_pct gauge
aerospike_ns_high_water_disk_pct{namespace="test"} 50
# HELP aerospike_ns_high_water_memory_pct high water memory pct
# TYPE aerospike_ns_high_water_memory_pct gauge
aerospike_ns_high_water_memory_pct{namespace="test"} 60
# HELP aerospike_ns_memory_size memory size
# TYPE aerospike_ns_memory_size gauge
aerospike_ns_memory_size{namespace="test"} 1.073741824e+09
# HELP aerospike_ns_objects objects
# TYPE aerospike_ns_objects gauge
aerospike_ns_objects{namespace="test"} 0
# HELP aerospike_ns_stop_writes_pct stop writes pct
# TYPE aerospike_ns_stop_writes_pct gauge
aerospike_ns_stop_writes_pct{namespace="test"} 90
Thanks very much!
It's basically just installed on an ec2 instance. I don't believe we have a username/password since its community edition. Could it be the node IP address? Anything else I can run to give you more info?
Everything runs on the same node? I run on ec2 (EE) for my work and we've never had an real problem. Yeah it's not an authentication issue, the error is different.
Any significant change you made to the config? Maybe a problem with the asprom binary you're using? (I just compiled mine from source. Where did you get yours so I can try?) Anything in aerospike logs?
Also can you log into asadm and give me a summary and info just to make sure everything looks good on the surface?
I think you're on to something - we're running a single node setup, but I think previously we ran a 2nd node (now turned off).
Here's the network settings:
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002 # Heartbeat port for this node.
# List one or more other nodes, one ip-address & port per line:
mesh-seed-address-port 10.10.10.10 3002
# mesh-seed-address-port 10.10.10.11 3002
# mesh-seed-address-port 10.10.10.12 3002
# mesh-seed-address-port 10.10.10.13 3002
# mesh-seed-address-port 10.10.10.14 3002
interval 250
timeout 10
}
fabric {
port 3001
}
info {
# Aerospike database configuration file for deployments using mesh heartbeats.
port 3003
}
}
$ tail -f /var/log/aerospike/aerospike.log
Apr 29 2020 00:48:35 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
Apr 29 2020 00:49:02 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
Apr 29 2020 00:49:28 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
Is 10.10.10.10 the node that's running or was that supposed to be the second one?
[EDIT] If it's the second one, your node probably just won't come up. Since it's trying to reach to an existing cluster before starting properly.
I think it's the second one. I tried commenting it out, but still same error and I don't see anything in the aerospike logs.
Assuming you have the default log configuration:
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
# Send log messages to stdout
console {
context any info
}
}
Would you mind stopping aerospike, clearing that log file (> /var/log/aerospike/aerospike.all.log) and restarting it and pasting it here? (feel free to remove anything you deem private).
Can you also run the asadm commands I asked about above?
$ asadm
Aerospike Interactive Shell, version 0.0.13
Found 1 nodes
Online: <IP>:3000
Admin> summary
ERR: Do not understand 'summary'
Admin> info
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Build Cluster Cluster Cluster Free Free Migrates Principal Objects Uptime
. . Size Visibility Integrity Disk% Mem% (tx,rx,q) . . .
i 3.6.3 1 True True 99 99 (0,0,0) i 275.563 K 00:02:50
Number of rows: 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Node Fqdn Ip Client Current HB HB
. Id . . Conns Time Self Foreign
i *BB99F9EC62DBE02 <DNS>:3000 <IP>:3000 6 325819903 0 0
Number of rows: 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Node Namespace Avail% Evictions Master Replica Repl Stop Disk Disk HWM Mem Mem HWM Stop
. . . . Objects Objects Factor Writes Used Used% Disk% Used Used% Mem% Writes%
i DAILY 99 0 11.229 K 0.000 1 false 4.340 MB 1 50 6.769 MB 1 60 90
i FIVE_MINUTES 97 0 200.471 K 0.000 1 false 78.228 MB 2 50 61.195 MB 2 60 90
i HOURLY 99 0 35.587 K 0.000 1 false 13.520 MB 1 50 14.354 MB 1 60 90
i LIVE 99 0 23.829 K 0.000 1 false 8.726 MB 1 50 10.332 MB 1 60 90
i MONTHLY 99 0 2.388 K 0.000 1 false 1023.000 KB 1 50 789.572 KB 1 60 90
i WEEKLY 99 0 2.059 K 0.000 1 false 792.875 KB 1 50 623.619 KB 1 60 90
i bar N/E 0 0.000 0.000 1 false N/E N/E 50 0.000 B 0 60 90
i test N/E 0 0.000 0.000 1 false N/E N/E 50 72.000 KB 1 60 90
Number of rows: 8
I set logging to INFO and I see a lot of these messages:
Apr 29 2020 01:33:13 GMT: INFO (query): (thr_query.c::2632) Query on non-existent set 35
FYI, we had a lot (1000+) of secondary indexes created (with many on non-existent sets... long story!). I would delete them all if I didn't have to do it one-by-one - do you happen to know a way to drop them all in one go? Could that be causing an issue?
These tools are super old. Where did you get them? Can you try and install a more recent version? https://www.aerospike.com/docs/operations/install/tools/index.html
Yeah I'm starting to wonder if you're not just waiting on a cold restart to happen or something like that.
The easiest way to get rid of the secondary indexes would probably be to do a backup, wipe the whole cluster, and restore without the secondary indexes and only re-create the ones you want.
I'm not familiar with that specific log message, but it looks like one of your clients is trying to read from that set that doesn't exist.
I'm kinda stomped tbh...
Especially with empty logs. Are you using systemctl? Sometimes a few things don't make it to the log file and are only in journalctl -u aerospike.
[EDIT] Inside of asadm you can try and run health and see if anything weird stands out.
[EDIT2] What happens if you try to run asinfo -v status?
Ok upgraded the tools to the latest. Is it worth upgrading Aerospike itself? Do you know where I can get the latest 3.x download?
$ asinfo -v status
ok
Admin> summary
Cluster
=======
1. Server Version : C-3.6.3
2. OS Version : Amazon Linux AMI 2015.09 (4.1.10-16.27.amzn1.x86_64)
3. Cluster Size : 1
4. Devices : Total 6, per-node 6
5. Memory : Total 20.000 GB, 1.15% used (235.520 MB), 98.85% available (19.770 GB)
6. Disk : Total 13.000 GB, 0.80% used (106.381 MB), 98.38% available contiguous space (12.790 GB)
7. Usage (Unique Data): 0.000 B in-memory, 86.661 MB on-disk
8. Active Namespaces : 6 of 8
9. Features : Aggregation, KVS, Query, SINDEX, Scan
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespaces~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace Devices Memory Disk Replication Master Usage Usage
. (Total,Per-Node) (Total,Used%,Avail%) (Total,Used%,Avail%) Factor Objects (Unique-Data) (Unique-Data)
. . . . . . In-Memory On-Disk
DAILY (1, 1) (2.000 GB, 1.00, 99.00) (1.000 GB, 0.42, 99.00) 1 11.192 K 0.000 B 3.522 MB
FIVE_MINUTES (1, 1) (4.000 GB, 2.00, 98.00) (4.000 GB, 1.91, 97.00) 1 200.495 K 0.000 B 63.863 MB
HOURLY (1, 1) (4.000 GB, 1.00, 99.00) (2.000 GB, 0.65, 99.00) 1 35.276 K 0.000 B 10.866 MB
LIVE (1, 1) (4.000 GB, 1.00, 99.00) (4.000 GB, 0.21, 99.00) 1 23.613 K 0.000 B 6.958 MB
MONTHLY (1, 1) (2.000 GB, 1.00, 99.00) (1.000 GB, 0.10, 99.00) 1 2.388 K 0.000 B 844.195 KB
WEEKLY (1, 1) (2.000 GB, 1.00, 99.00) (1.000 GB, 0.08, 99.00) 1 2.059 K 0.000 B 641.427 KB
bar (0, 0) (1.000 GB, 0.00, 100.00) (0.000 B, 0.00, 0.00) 1 0.000 0.000 B 0.000 B
test (0, 0) (1.000 GB, 1.00, 99.00) (0.000 B, 0.00, 0.00) 1 0.000 0.000 B 0.000 B
Number of rows: 8
Admin> health
INFO: Collecting 1 collectinfo snapshot. Use -n to set number of snapshots.
INFO: Snapshot 1
____________________________________Summary_____________________________________
Total: 113
Passed: 55
Failed: 5
Skipped: 53
_______________________________ PASS: count(55) ________________________________
ANOMALY: Service errors count anomaly check.
ANOMALY: Key Busy errors count anomaly check.
LIMITS: System open file descriptor limit check.
LIMITS: System swap check.
OPERATIONS: System OOM kill check.
OPERATIONS: System process blocking Check.
OPERATIONS: OS version check.
OPERATIONS: CPU config check.
OPERATIONS: Sysctl config check.
OPERATIONS: Firewall Check.
OPERATIONS: Service config runtime and conf file difference check.
OPERATIONS: Namespace config runtime and conf file difference check.
OPERATIONS: Migration thread configuration check.
OPERATIONS: Non-default namespace device high water mark check.
OPERATIONS: Non-default namespace device low water mark check.
OPERATIONS: Set delete status check.
OPERATIONS: Node read errors count check
OPERATIONS: Non-default namespace defrag-sleep check.
ANOMALY: CPU utilization anomaly check.
ANOMALY: Resident memory utilization anomaly.
ANOMALY: Set object count anomaly check.
LIMITS: Namespace available bin names check.
LIMITS: Namespace per node memory limit check.
OPERATIONS: Client connections check.
OPERATIONS: Namespace disk available pct check.
OPERATIONS: Service configurations difference check.
OPERATIONS: Device IO scheduler check.
OPERATIONS: Namespace device size configuration difference check.
OPERATIONS: Defrag low water mark misconfiguration check.
OPERATIONS: Namespace single node failure disk config check.
OPERATIONS: Namespace single node failure memory config check.
OPERATIONS: Namespaces per node count check.
OPERATIONS: Namespace configurations difference check.
OPERATIONS: Namespace HWM breach check.
OPERATIONS: Set eviction configuration difference check.
OPERATIONS: Set xdr configuration difference check.
OPERATIONS: XDR configurations difference check.
OPERATIONS: Cluster size check.
OPERATIONS: Services list discrepancy test.
PERFORMANCE: Fragmented Blocks check.
LIMITS: System memory percentage check.
OPERATIONS: Critical Namespace memory available pct check.
OPERATIONS: Critical Namespace disk available pct check.
OPERATIONS: Duplicate device/file check.
OPERATIONS: Namespace order check.
OPERATIONS: Cluster integrity fault check.
OPERATIONS: Cluster Key difference check.
OPERATIONS: Cluster stability check.
OPERATIONS: Critical cluster size check.
OPERATIONS: Paxos single replica limit check
OPERATIONS: UDF sync (file not matching) check.
OPERATIONS: UDF sync (availability on all node) check.
OPERATIONS: SINDEX sync state check.
OPERATIONS: SINDEX metadata sync (availability on all node) check.
PERFORMANCE: CPU utilization check.
________________________________ FAIL: count(5) ________________________________
INFO
LIMITS: Aerospike runtime memory configured < 5G.
OPERATIONS: ENA not enabled.
OPERATIONS: Non-default namespace replication-factor configuration.
WARNING
LIMITS: Namespace memory misconfiguration.
OPERATIONS: Old Amazon Linux AMI.
________________________________________________________________________________
INFO: Please use -v option for more details on failure.
I doubt upgrading would fix your issue. And I don't see an easy way to download something older than 4.0 (outside of docker). I think they removed them from public access. I have a few tarballs from more recent version than yours but they're for ubuntu. Not sure how useful that is to you.
I can't see anything obviously wrong jumping at me... Have you tried compiling asprom from source? I don't think you told me where/how you got your binary.
I just downloaded the 1.8.0 binary from here: https://github.com/alicebob/asprom/releases. I'll have a go at compiling 1.9.0 from source.
Yeah that binary works for me... So that's probably not your problem. I'll try to see if I can think of anything else... If you have a way for me to connect to that host I'm willing to investigate manually but I understand that's probably not possible...
If you have a way, I would try to start a node with no data (not connecting to your existing node obviously), see if you can get asprom working there. If not, maybe there is a bug in the aerospike binary you're using. In which case we can try to see if we can get another version for you. But that's about all the ideas I have left.
I just tried 1.9.0 in our production environment instead (Aerospike 3.14.1.1) and it works!!! That was the end goal anyway. I can try and get it working in staging later - maybe I'll try updating that to the same Aerospike binary.
Thanks so much for your help!
Ok weird... So then what's left is either a bug in the binary you're running, or something with the environment (the secondary indexes you were talking about, these kind of things). Well glad you fixed it :-)
If you want to clean your secondary indexes problem, I still recommend the backup/restore route. It will also let you start clean from any other weird leftovers.
FWIW I just hit the same scenario in my environment because I had forgotten to provide credentials to connect to the aerospike cluster. We should probably look into getting a better error out of this.
Ok, interesting.
On a slight aside, you wouldn't happen to know how to graph data over time (TTL)? Or even on one of our bins (timestamp) would be ideal.
I don't understand what you're asking. Do you mean graphing data that is in aerospike? Like using aerospike as a datasource? If so no it's not currently possible afaik. Aerospike is not a supported datasource for grafana, since it's not technically a time series database (there are ways to use it that way but...). You would basically need to write your own program to export metrics in prometheus (or other time series database) and use that.
Thanks for the response.
I'm trying to understand how much data we have stored over time. It would be great if there was a TTL metric for namespace/sets, but doesn't look like there is. I guess I need to look into asinfo hist-dump or histogram commands. Or perhaps a custom aql query using the timestamp bin.
I still don't understand what you're looking for sorry :-D
But yeah the histograms for TTLs don't seem to be available right now. You can try to submit an issue/feature request on the exporter supported by aerospike.
If you're interested in something like how many unique objects (vs master objects) you've had over the past year or something, you might be able to figure something out with the metrics related to evict/evicted/evictions. That should tell you how many messages have been removed because of TTLs.