datahub
datahub copied to clipboard
[EPIC] Improve accuracy and reliability of Grafana Dashboards
Summary
Grafana Dashboards are a comprehensive and useful tool that serves the following purposes,
- Highlights important metrics such as the number of active users
- Highlight hubs performance by looking at different indicators such as memory distribution, pod latency, etc..
Grafana is used by us extensively which highlights the critical role played by the tool. However, the grafana metrics can also be confusing to parse out in certain cases. The purpose of this enhancement request is to revamp the reporting to make it easier to interpret.
In the graph below, It highlights "Active users over 24 hours" while reporting metrics for the entire month. In addition, we are not sure whether this is the accurate number of unique users using Datahub during the past 24 hours,
These metrics in this graph below highlights monthly active users as 9k. However, it is not clear whether it is a unique number of users? If yes then whether the reported numbers are accurate.
Some of the data are not getting reported in the dashboard!
Sometimes, the dashboard allows for saving the changes which can be confusing!
In certain cases, the graphs appear twice
The categorization of the dashboard is not intuitive! For eg: What is the difference between JupyterHub vs JupyterHub Dashboard vs JupyterHub Original Dashboard? If we can categorize this clearly it would be valuable.
User Stories
- As a team member, I want to have access to metrics in the dashboard which are reliable and accurate for troubleshooting/evangelizing purposes.
Acceptance criteria
- Given an outage, I have all the information required as part of grafana to debug the issue
- Given a workshop or a meeting with leadership, I have accurate and reliable metrics that can get shared with stakeholders
Tasks to complete
- [x] #2859
- [ ] #2860
- [x] #2861
- [ ] #2943
- [ ] https://github.com/jupyterhub/jupyterhub/pull/4214 (Implement @yuvipanda's PR improving the accuracy of Grafana metrics for calculating active users)
Thanks a lot for opening this, @balajialg!
github.com/jupyterhub/jupyterhub-grafana deploys a set of common dashboards to a particular folder - in our case, https://grafana.datahub.berkeley.edu/dashboards/f/70E5EE84-1217-4021-A89E-1E3DE0566D93/jupyterhub-default-dashboards. I think this is the only one that's reliably consistent - the dashboards are version controlled, documented and somewhat understood by the broader community. I think every other dashboard is really one of us 'playing around', and I'm not sure how much I trust most of them.
Here's a suggestion on how to proceed.
- [x] Move all dashboards we are 'playing around with' to their own folder, so there is lesser confusion
- [ ] Add descriptions to all dashboards and panels in https://github.com/jupyterhub/grafana-dashboards
- [ ] Develop the 'usage metrics' dashboard some more - it can be very helpful for evangelism, but isn't in a usable state now.
How does this sound, @balajialg? If this sounds good, what kinda metrics will be useful for evangelism?
@yuvipanda These are awesome next steps! Makes a lot of sense. Moving the dashboards we are playing around with to separate folders and adding descriptions to all the dashboards would make it easier for interpretation.
In terms of metrics required for evangelism, I am looking at articulating our story in terms of our reach, impact, and the technical brilliance of the tool.
- REACH: What are our unique Daily Active Users (DAU) and Monthly Active Users (MAU) look like? (Dissecting this data across hubs)
- IMPACT: How much time are our users spending cumulatively across all the hubs (daily/monthly/yearly)? Articulating that in terms of years would be a powerful metric for evangelizing the extensive usage we are observing.
- IMPACT: How many assignments are completed cumulatively on a daily/monthly/yearly basis?
- TECHNICAL BRILLIANCE: Metrics around the time it takes for us to auto-scale to 1000's of users. Any other metric to articulate the value prop from a technology standpoint would be amazing.
@yuvipanda These are desirable metrics. Let me know how many of these requests are actually feasible. Thanks for taking the time to look into this. Appreciate it.
I've organized the dashboards into folders now:

The 'production' dashboards are clearly labelled now.
@yuvipanda Thanks for this.
When you are back from vacation on Monday (10/18), Can you prioritize the following request?
For the tech strategy meeting on 10/21, Jim Colliander, Erfan, and Bill Allison (CTO) are joining us. @ericvd-ucb suggested that we use this time to share an interim update on fall usage and make the case for additional resourcing. Would you be able to fill in some of the details required for this deck ? Please do let me know the data points that are not feasible to fetch at this juncture.
@yuvipanda Bringing this back to your attention to get your perspectives!
@balajialg I'm looking at slides 7, 8 and 9. I can produce data for 7 and 9, but I don't understand what 8 refers to. Can you expand a little bit?
@balajialg you should be able to get cost information from https://console.cloud.google.com/billing/013554-935B0A-B97AA1/reports;grouping=GROUP_BY_SKU;projects=ucb-datahub-2018?project=ucb-datahub-2018
@yuvipanda I don't have the required permission to view the billing information. Can you elevate the privileges for me?
done
@yuvipanda There seems to be a minor discrepancy between the grafana data and the raw data shared with me. Sharing the snapshots for the last 30 days,
Grafana data from the past 30 days
Analysis based on the raw data shared for the past 30 days,
Link to R notebook where I did the above analysis.
As I mentioned in the chart, it will be amazing to calculate the total time spent by users in the hub from an evangelizing perspective. Considering the mismatch between the start and stop actions, it will be great if we can figure out a way to log the stop actions of the users (whenever the culler shuts down the inactive users). We have a huge discrepancy with regards to the start and stop actions based on the below data,
start stop 158703 2099
We should completely discount the grarfana usage data - it was an experiment that hasn't been validated at all. The log data is definitely more accurate.
I agree on session lengths! I'll try think of a way to properly measure that.
@yuvipanda That will be awesome!
@yuvipanda Highlighting some of our grafana woes. Copy-pasting some of the grafana results for CPU allocation today. We definitely need a way to solve this issue with grafana not fetching data to unblock @felder whenever he needs to analyze stuff.
- [x] @felder to work with @yuvipanda to fix some of the issues with grafana. It might involve bumping up the limits of Prometheus!
@shaneknapp Just an FYI - Grafana improvements are something I wanted to discuss during our Sprint planning meeting but missed adding to our monthly backlog. Some of the graphs in the dashboard often break while changing time intervals resulting in empty responses. Here is an example from my exploration today,
Either the graph needs to be fixed or it should provide a better error message from an user standpoint.
@balajialg I believe the graphs are breaking because the responses are taking too long. If you click the red exclamation mark, you should be able to get an indication of why it's unhappy.
For example:


Basically the queries are timing out meaning that grafana is not getting the information as quickly as it expects to. So it's not really a matter of fixing the graph unless the query that generates the graph is grossly inefficient.
In the past the solution was to allocate more RAM to prometheus to speed it up. I'm not sure how sustainable that solution is. The more data that is collected, the slower it gets.
@felder Got it. So, optimizing the queries is the way forward? How does your PR here relate to this objective?
@balajialg optimizing might work... maybe?
you can get the same graphs/reports by running prometheus locally on your laptop and executing the queries there. perhaps it's time for me to remember to have @felder show me (and now you) how to do this. :)
@balajialg That PR was about fixing the queries because they were returning incorrect information. My PR doesn't have any relation to improving the efficiency of those queries.
I also do not know if optimizing the queries is the way forward as I do not know whether or not the queries can be optimized further. An investigation (along with an education on PromQL query optimization) would be required. Here's one example result from a google search for promql query optimization
https://thenewstack.io/query-optimization-in-the-prometheus-world/
Assuming the queries are suboptimal, then yes that might be one way to address this. However, if they cannot be optimized then we'll need to do something else. I just don't know that throwing RAM at it is the way to go or not...it could be. However it might also just be a game of whack-a-mole until adding RAM is no longer feasible.
@shaneknapp This seems like an interesting idea. Curious to see how my laptop would handle this volume of data (assuming this must be a large dataset). Look forward to learning more about this.
@felder Got it. Based on what you said, it seems like this is more of an experimental project like Google Filestore in terms of a) figuring out whether there are alternatives to increasing RAM and optimizing queries and b) finding a solution to optimize queries.
- [ ] Bump the version of Jupyterhub (Define the latest Jupyterhub changes properly and decide)
- [ ] Scoping this for maintenance window during Spring break?
- [ ] Blast an email to datahub-announce list to check about usage during MW and announce that there is a maintenance window!