dragonfly
dragonfly copied to clipboard
command duration
Dragonfly instance, there is no command duration information in the metrics information. When troubleshooting problems, the business side pays more attention to the command duration indicator.
We already collect stats about request execution time into request_latency_usec in main_service.cc, so it might be just a matter of exporting it.
Hi, I am thinking about solving this issue as a first issue.
Is this a wanted feature? Is it ready to be implemented? If so, @royjacobson, what do you mean by exporting it? Should it be exported via output of a CLI command?
@dor132 @royjacobson Yes, this is the performance parameter we need more. Can we put this in the metrics output? In this way, it is more convenient for us to display the delay data of instance execution commands through Prometheus and Grafana
Adding to what @kissingtiger said:
Dragonfly exposes metrics data in the Prometheus format through the dfly:6379/metrics HTTP page. To add a new metric, you can look at the code of the PrintPrometheusMetrics function in server_family.cc.
The web page should look like this -
@royjacobson Displayed on this webpage, it is very perfect and consistent with my understanding.
Is this expected to be launched in the next version?
@royjacobson Displayed on this webpage, it is very perfect and consistent with my understanding.
Is this expected to be launched in the next version?
The screenshot is of the current metrics page, which doesn't contain command duration metrics yet.
@royjacobson OK,we need command duration metrics .
@royjacobson Displayed on this webpage, it is very perfect and consistent with my understanding.
Is this expected to be launched in the next version?
The screenshot is of the current metrics page, which doesn't contain command duration metrics yet.
Ok. I see three options to display the command duration in the metrics page:
- an average for each command type
- an exact time per command executed
- an average for all commands executed up until a certain time
Should it be one of these options, or a completely different thing I missed?
@royjacobson @kissingtiger What do you think?
@dor132 very good! I'm looking forward to it. Is this feature expected to be available in the next version?
Hi @dor132 I see that redis returns in info commandstats the following mertrics calls=XXX,usec=XXX,usec_per_call=XXX currently dragonfly returns only the calls=XXX I think it will be good to add the use and the usec_per_call in the info commandstats and add this to metrics page the usec is the total time per command and the usec_per_call is the average i.e usec/calls
Hey, any chance to get this assigned?
It has also been requested by one of our costumes "so we can actually see how dragonfly is performing".
(they refered to exporting it as Prometheus metrics)
I think it would be very neat to export the command duration/latency as a prometheus histrogram as well.
It's not a trivial task to gather histograms efficiently. Basically you need a high-resolution histogram per each command for each thread. It's 200 x 100 x |threads| space and even then, you can not compute necessarily compute predefined percentiles (only their approximations).
Btw, redis latencies are global across the process lifetime, IMHO, meaning they are not decaying. We can start with that.
(for comparison, our HTTP latencies are moving averages, which are more useful imho ).
When evaluating Redis Cluster vs DragonflyDB this is the most immediate gap that we've noticed - we can only judge Dragonfly's performance from outside-in using our application metrics, but there is no latency information provided by Dragonfly itself. This is a noticeable regression vs our current setup. Would love to be able to see types of commands executed along with latencies.
We will introduce this feature soon
It's not a trivial task to gather histograms efficiently. Basically you need a high-resolution histogram per each command for each thread. It's
200 x 100 x |threads|space and even then, you can not compute necessarily compute predefined percentiles (only their approximations). Btw, redis latencies are global across the process lifetime, IMHO, meaning they are not decaying. We can start with that. (for comparison, our HTTP latencies are moving averages, which are more useful imho ).
I kind of assumed that the dashboards people use just collect the data over time and compute some kind of a derivative to show the 'current' latencies. Is this not the case?
re:histograms, I agree it's a lot more work but I don't think it's that hard and I think it's useful if we could do it intelligently.
you are right, exposing count, sum that are monotonically increasing is enough for computing a moving average.
And i just noticed redis7 even exposes percentiles. In any case it will be a different task and different priority.
Web says t-digest is the data structure for probabilistic estimation of percentiles for online data and Redis Labs even released a data structure as part of their Stack API: https://redis.com/blog/t-digest-in-redis-stack/