jupyter_server icon indicating copy to clipboard operation
jupyter_server copied to clipboard

add metric for kernel restarts

Open minrk opened this issue 2 years ago • 16 comments

labels: kernel_name = kernel name, source = "restarter" or "user"

I don't love type as a label, but it's already used in running kernels, so it's probably more important that it match other metrics than it be the more logical name or kernel_name, used elsewhere in the API. Same goes for source to distinguish API ('user') restarts from auto restarts by the KernelRestarter ('restarter'). I also considered trigger=user|crash (or 'exit' instead of crash, which is slightly more precise).

closes #1240

minrk avatar Mar 22 '23 13:03 minrk

nbclassic's shimming appears to do some weird stuff, causing this same module to get imported twice under different names, and the canonical name attempts to import the deprecated name. Will have to think about the best way to resolve that, since I think the current try: import except: define doesn't make sense if nbclassic is up-to-date and notebook is not present.

minrk avatar Mar 22 '23 14:03 minrk

I agree that kernel_name (or whatever we actually use in kernelspec) is a better label to use. We can add that to the existing kernel metric, and use just that (and deprecate 'type' in the existing metric) - what do you think of that? That allows us to be consistent moving forward as we add more metrics. Matching terminology between kernelspec and metrics seems important.

yuvipanda avatar Mar 22 '23 14:03 yuvipanda

I do like 'source' better, as that looks like something we can be more 'sure' of - we know the restarter restarted it, or the user restarted it. I feel 'trigger' implies more causality than exists.

yuvipanda avatar Mar 22 '23 14:03 yuvipanda

We can add that to the existing kernel metric, and use just that (and deprecate 'type' in the existing metric) - what do you think of that?

I like it, except it's complicated by the notebook->jupyter server migration due to prometheus enforcing uniqueness at import time, so we can't change the notebook metric.

I think it would probably be good to put a jupyter_ or jupyter_server_ prefix on all our metrics. That may be a path forward to avoid the circular import problem, too. It's technically a backward-incompatible change, though, so I'm not sure the best way to approach it.

minrk avatar Mar 22 '23 14:03 minrk

YESSS let's prefix it all. We do that for jupyterhub too.

Maybe we leave the current ones as they are, just replicate them anew with new metrics with the prefix, and get rid of the old ones in a major point release?

yuvipanda avatar Mar 22 '23 14:03 yuvipanda

Codecov Report

Patch coverage: 58.33% and project coverage change: +1.22 :tada:

Comparison is base (60d7cca) 79.11% compared to head (44c5bb8) 80.33%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1241      +/-   ##
==========================================
+ Coverage   79.11%   80.33%   +1.22%     
==========================================
  Files          68       68              
  Lines        8263     8274      +11     
  Branches     1600     1602       +2     
==========================================
+ Hits         6537     6647     +110     
+ Misses       1304     1203     -101     
- Partials      422      424       +2     
Impacted Files Coverage Δ
jupyter_server/prometheus/metrics.py 69.23% <50.00%> (-30.77%) :arrow_down:
jupyter_server/services/kernels/kernelmanager.py 81.38% <50.00%> (-0.44%) :arrow_down:
jupyter_server/services/kernels/handlers.py 89.18% <100.00%> (+0.30%) :arrow_up:

... and 4 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Mar 22 '23 14:03 codecov[bot]

Yeah, I think that makes sense. For now, I just updated the new metric to follow that pattern, and I can do a separate PR to deprecate the old metrics.

minrk avatar Mar 22 '23 14:03 minrk

Should it be jupyter_ or jupyter_server_?

yuvipanda avatar Mar 22 '23 14:03 yuvipanda

Should it be jupyter_ or jupyter_server_?

I literally changed it 3 times while writing it. I don't know! jupyter_ makes a little more sense to me, since you get things like jupyter_kernel_restarts, but I'm certainly open to jupyter_server. It's pretty long, especially since everything also gets a subsection_ prefix like http_, kernel_, etc.

minrk avatar Mar 22 '23 15:03 minrk

Thanks for this PR @minrk.

I think a "standard" jupyter_ prefix is best (and consistent with "factory" env names). I also agree with @yuvipanda to use the term from the kernelspec, which would be "name" over "type"; "kernel_name" works as well.

Regarding "Should the metric include the kernel ID?", I think all metrics should include a "subject" indicator when applicable (and not considered PII) and, in this case, kernel_id is extremely applicable. I could definitely see cluster admins needing to correlate the "source = restarter" restart metric to a particular kernel (and therefore user) when they see resources being depleted on various nodes because the kernel's restart (due to OOM) is bouncing around the cluster.

kevin-bates avatar Mar 22 '23 15:03 kevin-bates

@kevin-bates thanks! For metrics, prometheus recommends avoiding labels with too much churn, because each unique value for a label can be costly. I think this might be a case where monitoring (prometheus) metrics, and something like a Jupyter server Event might have different levels of detail (I think we should indeed have both for restart):

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

kernel IDs being UUIDs, and unique for every kernel across time seems to fit this description (kubernetes pod name also fits this, but while it's technically unbounded, it usually grows relatively slowly on the order of prometheus data retention)

So my inclination is to exclude the kernel id for now.

particular kernel (and therefore user)

Identifying the server is definitely useful. Metrics from each instance are still stored separately, but I think not identified by default. I believe prometheus scrape config can add the server as a label via e.g. __meta_kubernetes_pod_name (in kubernetes) or some such. I'm not sure how easy that is, or if we should somehow add self-identifying labels to all metrics in the server.

minrk avatar Mar 23 '23 08:03 minrk

I'm also on the fence for whether to use kernel_name vs name. I think the kernel_ prefix is useful when the namespace is ambiguous, and prometheus metric labels are a flat namespace, which can merge information from multiple sources (e.g. adding kubernetes and prometheus metadata fields), so I think the kernel_ prefix may be appropriate here. I don't feel strongly, though, and matching exactly makes sense, too!

minrk avatar Mar 23 '23 09:03 minrk

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

yuvipanda avatar Mar 23 '23 11:03 yuvipanda

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

yuvipanda avatar Mar 23 '23 11:03 yuvipanda

For metrics, prometheus recommends avoiding labels with too much churn, because each unique value for a label can be costly.

I didn't realize prometheus had this "pattern/footprint" behavior and was going to ask about events as well. Given prometheus' recommendation I would agree that not including the kernel id makes sense - although much less helpful from a diagnostics perspective. The corresponding event will need to include the kernel_id.

Since the prometheus namespace is flat using kernel_name as the label makes sense as well although can 'kernel_' be inferred from the metric name: jupyter_kernel_restarts? Can you provide the result of a metric capture?

IMO, I've somewhat strong feelings about jupyter_server vs jupyter_. jupyter as a word is super diffuse, and prefixing the metric with the software package where that is coming from is extremely useful. If I just see a jupyter_ prefix, it could be coming from literally hundreds of packages by now...

It makes sense to use the package name as the prefix although I'm not ever going to propose a prefix of jupyter_enterprise_gateway_ unless we as a community adopt it as a convention. I guess I was hoping folks could infer the source of the metric based on other contextual information.

kevin-bates avatar Mar 23 '23 15:03 kevin-bates