opencost Monitoring Kubecost?

How it is recommended to monitor that the cost model is working as expected? For example, this page is highly valuable for monitoring: But it's not exposed as metrics... How will you recommend me to monitor this product?

Dec 31 '19 14:12 omerlh

Hi Omer, good question. What exactly are you looking to monitor? If cost model metrics are being collected? Or that the cost model is correctly able to serve data? Or something else?

Jan 01 '20 17:01 dwbrown2

I'm looking to monitor that cost model is working correctly, and I can trust the data when I'll need it.

Jan 01 '20 18:01 omerlh

The best approach to do this continuously is likely to use the aggregated cost model API. This reports the same data as the Kubecost frontend and would go above and beyond what the Prometheus diagnostic test does.

Jan 02 '20 20:01 dwbrown2

How would you recommend using it? Getting aggregation for the last 5 minutes?

Jan 06 '20 12:01 omerlh

Depends exactly what you are looking to confirm, but you could do a short time window, e.g. 1-5 minutes, like that!

Jan 06 '20 19:01 dwbrown2

Understood. Is there any chance to implement it as Prometheus metric? Will be a lot easier to monitor :)

Jan 07 '20 06:01 omerlh

Yep, this can be accomplished! You would need to monitor several metrics for this to be complete, or you could do the simple test that our /metrics is UP.

Jan 07 '20 23:01 dwbrown2

Which metrics should I use? Monitoring the /metrics (or just using up metric) just ensure that the service is running. no?

Jan 08 '20 06:01 omerlh

Depends on your cluster, e.g. using GPUs, etc. But monitoring a subset of these metrics is likely best: https://github.com/kubecost/cost-model/blob/master/PROMETHEUS.md#available-metrics

Jan 08 '20 22:01 dwbrown2

I understand, thanks!

Jan 13 '20 19:01 omerlh

I ended up using the following query:

absent(node_total_hourly_cost) == 1

What do you think about adding a prometheus rule to the chart?

Jan 30 '20 15:01 omerlh

Yeah, we can explore adding something like this for you if it would be helpful and our product is overall a fit for you!

Jan 30 '20 18:01 dwbrown2

@omerlh are you still using the product? We’re reviewing priorities for our next sprint. Would you want to discuss soon?

Apr 07 '20 20:04 AjayTripathy

Sure, please add also @shaikatz to the discussion :)

Apr 13 '20 07:04 omerlh

Hey the checks appearing on diagnostics.html would be great to have in metrics. I am happy to contribute this if the maintainers think that this feature is worth having in upstream.

Use-case: Let's say I screw up my IAM Policy, and cost-model is not able to access the spot data feed anymore or my athena table name in config is messed up due to incorrect value during an update. I would like to get alerted for this. For open-source users all the checks in form of metric would allow them to use the tools of their choice to create these monitors and alert to the their preferred channels based on internal SLAs

Apr 07 '22 12:04 smitthakkar96

@smitthakkar96 any other checks in diagnostics.html that would be most helpful? All of the check below currently have underlying Prometheus measures that you can monitor.

I do agree adding one for cloud integrations would be interesting.

Apr 08 '22 03:04 dwbrown2

Maybe it would be helpful to just document what these prometheus monitoring queries are? @kbrwn this will also pertain to monitoring the hosted solution...want to take a first pass at that documentation?

Apr 11 '22 20:04 AjayTripathy

@dwbrown2 @AjayTripathy It is not very clear which queries from cost-analyzer-frontend repo which queries are made to make these checks. Maybe I am just looking at the wrong file? queries.js?

There are multiple issues about this topic. I agree we can document right now with the existing metrics some tips for monitoring. Although I conclude the several issues about this topic is because users want metrics related to the specific performance and actions of kubecost. It seems like we just ask the same questions on these issues over and over instead of taking initiative. A user may not be able to articulate exactly what they want to monitor. Users may not know the important components of kubecost. Here are some examples of metrics for monitoring we could provide:

histogram metric for frontend response times and request
histogram metric for etl build times
histogram metric for api queries time
counter metric for api queries
counter metrics for 400/500 errors
gauge metric for number of concurrent requests
gauge metric for # of green etl days
counter metrics for queries to prometheus
histogram/summary metric for prometheus query response times
counter metric for # of metrics produced by kubecost pod
counter metric for failed cloud cost report queries

Apr 11 '22 22:04 kbrwn

We are interested in the entire list @kbrwn posted, with a focus on:

histogram metric for frontend response times and request
histogram metric for API response times and request
histogram metric for etl build times
histogram metric for etl file sizes
gauge metric for # of green etl days
gauge metric for # of NON-green etl days

Apr 27 '22 20:04 MrColeC

This is great, thanks for sharing @MrColeC! We have this on the docket for our upcoming release (1.94). We'll start to review next week and will find owners soon after.

cc @AdamStack18

Apr 27 '22 23:04 dwbrown2

Close in favor of https://github.com/kubecost/docs/issues/304

@AjayTripathy or @dwbrown2, Could you add me to the Opencost organization when you have time? Need permissions to close out and label issues.

Jul 21 '22 20:07 Adam-Stack-PM

This issue is being closed because it may not be relevant to the OpenCost project and appears stale. If you feel this was closed in error, please open a new OpenCost issue with updated details or if it is still relevant for Kubecost please open an issue with Kubecost Support.

Apr 14 '23 07:04 mattray