datahub icon indicating copy to clipboard operation
datahub copied to clipboard

[EPIC] Improve accuracy and reliability of Grafana Dashboards

Open balajialg opened this issue 3 years ago • 21 comments

Summary

Grafana Dashboards are a comprehensive and useful tool that serves the following purposes,

  1. Highlights important metrics such as the number of active users
  2. Highlight hubs performance by looking at different indicators such as memory distribution, pod latency, etc..

Grafana is used by us extensively which highlights the critical role played by the tool. However, the grafana metrics can also be confusing to parse out in certain cases. The purpose of this enhancement request is to revamp the reporting to make it easier to interpret.

In the graph below, It highlights "Active users over 24 hours" while reporting metrics for the entire month. In addition, we are not sure whether this is the accurate number of unique users using Datahub during the past 24 hours,

image (3)

These metrics in this graph below highlights monthly active users as 9k. However, it is not clear whether it is a unique number of users? If yes then whether the reported numbers are accurate.

Capture

Some of the data are not getting reported in the dashboard!

Screen Shot 2021-09-29 at 1 33 41 PM Screen Shot 2021-09-29 at 1 31 12 PM Screen Shot 2021-09-29 at 1 29 52 PM

Sometimes, the dashboard allows for saving the changes which can be confusing!

Screen Shot 2021-09-29 at 1 37 47 PM

In certain cases, the graphs appear twice

Screen Shot 2021-09-29 at 1 44 20 PM

The categorization of the dashboard is not intuitive! For eg: What is the difference between JupyterHub vs JupyterHub Dashboard vs JupyterHub Original Dashboard? If we can categorize this clearly it would be valuable.

image (2)

User Stories

  • As a team member, I want to have access to metrics in the dashboard which are reliable and accurate for troubleshooting/evangelizing purposes.

Acceptance criteria

  • Given an outage, I have all the information required as part of grafana to debug the issue
  • Given a workshop or a meeting with leadership, I have accurate and reliable metrics that can get shared with stakeholders

Tasks to complete

  • [x] #2859
  • [ ] #2860
  • [x] #2861
  • [ ] #2943
  • [ ] https://github.com/jupyterhub/jupyterhub/pull/4214 (Implement @yuvipanda's PR improving the accuracy of Grafana metrics for calculating active users)

balajialg avatar Sep 29 '21 21:09 balajialg

Thanks a lot for opening this, @balajialg!

github.com/jupyterhub/jupyterhub-grafana deploys a set of common dashboards to a particular folder - in our case, https://grafana.datahub.berkeley.edu/dashboards/f/70E5EE84-1217-4021-A89E-1E3DE0566D93/jupyterhub-default-dashboards. I think this is the only one that's reliably consistent - the dashboards are version controlled, documented and somewhat understood by the broader community. I think every other dashboard is really one of us 'playing around', and I'm not sure how much I trust most of them.

Here's a suggestion on how to proceed.

  • [x] Move all dashboards we are 'playing around with' to their own folder, so there is lesser confusion
  • [ ] Add descriptions to all dashboards and panels in https://github.com/jupyterhub/grafana-dashboards
  • [ ] Develop the 'usage metrics' dashboard some more - it can be very helpful for evangelism, but isn't in a usable state now.

How does this sound, @balajialg? If this sounds good, what kinda metrics will be useful for evangelism?

yuvipanda avatar Sep 30 '21 10:09 yuvipanda

@yuvipanda These are awesome next steps! Makes a lot of sense. Moving the dashboards we are playing around with to separate folders and adding descriptions to all the dashboards would make it easier for interpretation.

In terms of metrics required for evangelism, I am looking at articulating our story in terms of our reach, impact, and the technical brilliance of the tool.

  1. REACH: What are our unique Daily Active Users (DAU) and Monthly Active Users (MAU) look like? (Dissecting this data across hubs)
  2. IMPACT: How much time are our users spending cumulatively across all the hubs (daily/monthly/yearly)? Articulating that in terms of years would be a powerful metric for evangelizing the extensive usage we are observing.
  3. IMPACT: How many assignments are completed cumulatively on a daily/monthly/yearly basis?
  4. TECHNICAL BRILLIANCE: Metrics around the time it takes for us to auto-scale to 1000's of users. Any other metric to articulate the value prop from a technology standpoint would be amazing.

@yuvipanda These are desirable metrics. Let me know how many of these requests are actually feasible. Thanks for taking the time to look into this. Appreciate it.

balajialg avatar Sep 30 '21 16:09 balajialg

I've organized the dashboards into folders now:

image

The 'production' dashboards are clearly labelled now.

yuvipanda avatar Oct 01 '21 14:10 yuvipanda

@yuvipanda Thanks for this.

When you are back from vacation on Monday (10/18), Can you prioritize the following request?

For the tech strategy meeting on 10/21, Jim Colliander, Erfan, and Bill Allison (CTO) are joining us. @ericvd-ucb suggested that we use this time to share an interim update on fall usage and make the case for additional resourcing. Would you be able to fill in some of the details required for this deck ? Please do let me know the data points that are not feasible to fetch at this juncture.

balajialg avatar Oct 15 '21 20:10 balajialg

@yuvipanda Bringing this back to your attention to get your perspectives!

balajialg avatar Oct 18 '21 18:10 balajialg

@balajialg I'm looking at slides 7, 8 and 9. I can produce data for 7 and 9, but I don't understand what 8 refers to. Can you expand a little bit?

yuvipanda avatar Oct 18 '21 19:10 yuvipanda

@balajialg you should be able to get cost information from https://console.cloud.google.com/billing/013554-935B0A-B97AA1/reports;grouping=GROUP_BY_SKU;projects=ucb-datahub-2018?project=ucb-datahub-2018

yuvipanda avatar Oct 18 '21 19:10 yuvipanda

@yuvipanda I don't have the required permission to view the billing information. Can you elevate the privileges for me?

image

balajialg avatar Oct 18 '21 20:10 balajialg

done

yuvipanda avatar Oct 18 '21 20:10 yuvipanda

@yuvipanda There seems to be a minor discrepancy between the grafana data and the raw data shared with me. Sharing the snapshots for the last 30 days,

Grafana data from the past 30 days image

Analysis based on the raw data shared for the past 30 days,

image

Link to R notebook where I did the above analysis.

As I mentioned in the chart, it will be amazing to calculate the total time spent by users in the hub from an evangelizing perspective. Considering the mismatch between the start and stop actions, it will be great if we can figure out a way to log the stop actions of the users (whenever the culler shuts down the inactive users). We have a huge discrepancy with regards to the start and stop actions based on the below data,

start stop 158703 2099

balajialg avatar Oct 19 '21 19:10 balajialg

We should completely discount the grarfana usage data - it was an experiment that hasn't been validated at all. The log data is definitely more accurate.

I agree on session lengths! I'll try think of a way to properly measure that.

yuvipanda avatar Oct 20 '21 08:10 yuvipanda

@yuvipanda That will be awesome!

balajialg avatar Oct 20 '21 17:10 balajialg

@yuvipanda Highlighting some of our grafana woes. Copy-pasting some of the grafana results for CPU allocation today. We definitely need a way to solve this issue with grafana not fetching data to unblock @felder whenever he needs to analyze stuff. image Screen Shot 2022-02-18 at 12 53 19 PM

balajialg avatar Feb 18 '22 20:02 balajialg

  • [x] @felder to work with @yuvipanda to fix some of the issues with grafana. It might involve bumping up the limits of Prometheus!

balajialg avatar Feb 24 '22 20:02 balajialg

@shaneknapp Just an FYI - Grafana improvements are something I wanted to discuss during our Sprint planning meeting but missed adding to our monthly backlog. Some of the graphs in the dashboard often break while changing time intervals resulting in empty responses. Here is an example from my exploration today,

image

Either the graph needs to be fixed or it should provide a better error message from an user standpoint.

balajialg avatar Nov 08 '22 23:11 balajialg

@balajialg I believe the graphs are breaking because the responses are taking too long. If you click the red exclamation mark, you should be able to get an indication of why it's unhappy.

For example:

Screen Shot 2022-11-08 at 3 46 49 PM Screen Shot 2022-11-08 at 3 46 58 PM

Basically the queries are timing out meaning that grafana is not getting the information as quickly as it expects to. So it's not really a matter of fixing the graph unless the query that generates the graph is grossly inefficient.

In the past the solution was to allocate more RAM to prometheus to speed it up. I'm not sure how sustainable that solution is. The more data that is collected, the slower it gets.

felder avatar Nov 08 '22 23:11 felder

@felder Got it. So, optimizing the queries is the way forward? How does your PR here relate to this objective?

balajialg avatar Nov 08 '22 23:11 balajialg

@balajialg optimizing might work... maybe?

you can get the same graphs/reports by running prometheus locally on your laptop and executing the queries there. perhaps it's time for me to remember to have @felder show me (and now you) how to do this. :)

shaneknapp avatar Nov 09 '22 00:11 shaneknapp

@balajialg That PR was about fixing the queries because they were returning incorrect information. My PR doesn't have any relation to improving the efficiency of those queries.

I also do not know if optimizing the queries is the way forward as I do not know whether or not the queries can be optimized further. An investigation (along with an education on PromQL query optimization) would be required. Here's one example result from a google search for promql query optimization https://thenewstack.io/query-optimization-in-the-prometheus-world/

Assuming the queries are suboptimal, then yes that might be one way to address this. However, if they cannot be optimized then we'll need to do something else. I just don't know that throwing RAM at it is the way to go or not...it could be. However it might also just be a game of whack-a-mole until adding RAM is no longer feasible.

felder avatar Nov 09 '22 00:11 felder

@shaneknapp This seems like an interesting idea. Curious to see how my laptop would handle this volume of data (assuming this must be a large dataset). Look forward to learning more about this.

@felder Got it. Based on what you said, it seems like this is more of an experimental project like Google Filestore in terms of a) figuring out whether there are alternatives to increasing RAM and optimizing queries and b) finding a solution to optimize queries.

balajialg avatar Nov 09 '22 03:11 balajialg

  • [ ] Bump the version of Jupyterhub (Define the latest Jupyterhub changes properly and decide)
  • [ ] Scoping this for maintenance window during Spring break?
  • [ ] Blast an email to datahub-announce list to check about usage during MW and announce that there is a maintenance window!

balajialg avatar Feb 03 '23 01:02 balajialg