biggraphite
biggraphite copied to clipboard
Missing Metrics
Hey guys,
Wondering if someone could assist with an issue I'm having with BigGraphite [BG]. It currently receives a large number of metrics, but appears to drop a noticable proportion randomly... this was highlighted when looking at metrics from Apache Spark, which has frequent gaps per hour (of one minute each).
Infrastructure Setup:
- Within EKS (1.20)
- internal AWS NLB
- Traffic Flow: NLB -> Carbon Container -> {elasticsearch + cassandra}
- Carbon: Running inside an upstream Alpine container
- PS: 1 root 0:00 {entrypoint} /bin/sh /entrypoint 49 root 0:00 runsvdir -P /etc/service 51 root 0:00 runsv bg-carbon 52 root 0:03 runsv brubeck 53 root 0:00 runsv carbon 54 root 0:00 runsv carbon-aggregator 55 root 0:03 runsv carbon-relay 56 root 0:03 runsv collectd 57 root 0:00 runsv cron 58 root 0:00 runsv go-carbon 59 root 0:00 runsv graphite 60 root 0:00 runsv nginx 61 root 0:03 runsv redis 62 root 0:00 runsv statsd 63 root 0:00 tee -a /var/log/carbon.log 65 root 0:00 tee -a /var/log/carbon-relay.log 68 root 0:00 tee -a /var/log/statsd.log 69 root 0:01 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0 70 root 0:09 {node} statsd /opt/statsd/config/tcp.js 71 root 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf 76 root 0:00 /usr/sbin/crond -f 79 nginx 0:00 nginx: worker process 80 nginx 0:00 nginx: worker process 81 nginx 0:00 nginx: worker process 82 nginx 0:00 nginx: worker process 85 root 0:35 tee -a /var/log/bg-carbon.log 86 root 45:27 /opt/graphite/bin/python3 /opt/graphite/bin/bg-carbon-cache start --nodaemon --debug 88 root 0:00 tee -a /var/log/carbon-aggregator.log 156 root 0:41 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0 157 root 0:49 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0 158 root 0:46 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0 159 root 0:47 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
I can see traffic coming in to the interface (tcpdump/tcpflow), and can see logs to bg-carbon.log with references to 'cache query', but almost no datapoint logs for spark metrics.
Any assistance in troubleshooting would be greatly appreciated!
If you look on the Cassandra side:
- do you have errors?
- do you see a drop in write ops when you notice the drops?
Inside your container, does carbon restarts by itself?
Apologies for the delay in coming back to you!
I've rebuilt the cache container to only run carbon cache. Previously, it was running statds+carbon+etc, and this was all under supervisord, or similar. The container now runs carbon exclusively.
At first, and under low load, there were no metric drop-outs at all. We were shipping all metrics for spark, and it was bulletproof. As soon as we started shipping more metrics from other services, we began to see drop-outs of 1-2 minutes. across multiple metrics. Another interesting observation is that metrics appear to disappear at times - I'm not sure if they are being overwritten by null values? What I can tell you is that metrics are being fed into now what is a dedicated carbon ingress, and being inspected from another graphite endpoint, so whisper data is not a thing.
I've made multiple tweaks to the configs, but I'm at a bit of a loss as to how to eradicate the intermittent data loss.
Any help would be GREATLY appreciated!
TIA!