npbackup icon indicating copy to clipboard operation
npbackup copied to clipboard

question : Grafana template filtering

Open GuitarBilly opened this issue 8 months ago • 6 comments

grafana template : NPBackup v3 20250306

Unfiltered:

All config are set up with a Machine Identifier structure like ${HOSTNAME}__${RANDOM}[4]. The x300 machine was run with multiple config files, that is visible in the Backup_Job filter: Image

Exceptions:

  1. Some items for the x300 machine appear duplicated with the same Machine Identifier. Should that not be the other randomized x300 Machine ID? Or is the ${ Backup_Job} somehow wrong?

Image

  1. In the Unfiltered view you can see one additional entry with the same Machine Identifier for each machine in the "Snapshot size in restore mode" panel.

Filtered:

When I filter on pop-os at backup_job, then most panels are taking this setting. Image

Exception:

  1. Global Tenant panels. here all backup_job are still visible.

Question

Could you check at your side if you see the same behaviour with this Grafana template?

GuitarBilly avatar Apr 17 '25 20:04 GuitarBilly

Did you by any chance restart the prometheus gateway in between runs ?

deajan avatar Apr 18 '25 16:04 deajan

@deajan, the answer to that is yes. I could not remember so I verified the status. The last "Started" was 2025-04-10 21:24:31, that is close to where the gap begins in "last 30 days" from above graph (left half of data ends April 10 23:29:00).

GuitarBilly avatar Apr 20 '25 17:04 GuitarBilly

Using push gateway, metrics stay until push gateway is restarted (which IMO is a bad design choice, but yet here we are). So once restarted, the same metric will become a "new series" in prometheus.

In any case, metrics become "new series" when stall more than 5 minutes in prometheus.

Does this answer your question ?

deajan avatar Apr 21 '25 07:04 deajan

@deajan I think it does for the duplicate series (item 2), but I had two other observations.

  1. x300werkstation is noted 4 times with _nFkn instead of expected four available configs:

Image

  1. no filtering in the Global Tenant stats.

I can debug item 1 more myself, item 3 should be easy for you to check?

GuitarBilly avatar Apr 23 '25 12:04 GuitarBilly

For point 3: As it's name suggests, it's a global display that uses the same metrics as shown above with filters, but tenant wide without any other filters, so there's nothing wrong for me. Since you don't have tenants, of course you don't have utility for that part of the dashboard perhaps.

I'd still like to know why your setup doesn't "like" tenants, even in mono-tenant mode, you shoud have a default tenant value.

deajan avatar Apr 23 '25 12:04 deajan

@deajan, I'd like to focus on the Tenant item, the others seem not related to npbackup. How to go about and investigate that? I'm willing to experiment with my setup (described here: https://github.com/netinvent/npbackup/issues/153#issuecomment-2779795651)

The only thing I can think about now is that on most client machines I filled in values for Machine Group by hand, i.e.

Image

background info

I believe my Prometheus instance configuration is default, Image

The metrics destination used for all machines is like this : http://my_prometheus_pushgateway:9091/metrics/job/${BACKUP_JOB}

My clients do not all run the same npbackup version; oldest version is v3.0.0-rc15.

Another link to the other issue https://github.com/netinvent/npbackup/issues/153#issue-2962458957 where I "reconfigured the "backup_job" variable to remove tenant label filter" is that my Grafana dashboard is lacking the __tenant_id__ column, while the one on the github page has it included.

mine:

Image

npbackup github:

Image

TODO

  • Replace my dashboard with the latest from github and check the columns. EDIT: done, my dashboard is/was the latest v3 2025.03.06. without the tenant filter edit i see no columns at all. 8-) after the edit i see what i pasted above, no tenant_id, but see below on the available metrics labels that i see.
  • Where to I find my default __tenant_id__ in the pushgateway metrics? here the npbackup_exec_state labels: action="backup" backup_job="redacted" group="redacted" instance="redacted" job="redacted" npversion="npbackup3.0.0-rc15-gui" repo_name="default" timestamp="1745596813"

GuitarBilly avatar Apr 27 '25 17:04 GuitarBilly

Sorry for the late reply, had a fullscale datacenter disaster recovery transition to planify / execute these days.

I guess you just don't have the tenant_id label since you're running mono-tenancy. Does the tenant variable in your dashboard have the "all" value set ?

deajan avatar May 11 '25 12:05 deajan

@deajan , yes it has the "all" but nothing else:

Image

GuitarBilly avatar May 11 '25 15:05 GuitarBilly

Would you mind trying to add .* as custom all value to the tenant variable and tell me if it resolves your issue ?

deajan avatar May 11 '25 16:05 deajan

Done and no change, but what would it do if I do not have a tenant label in my pushgateway? Ultimately what npbackup pushes to the prometheus pushgateway should be what ends up in the prometheus database, right? As posted earlier here are my labels.

 action="backup" 
 backup_job="redacted" 
 group="redacted" 
 instance="redacted" 
 job="redacted" 
 npversion="npbackup3.0.0-rc15-gui" 
 repo_name="default" 
 timestamp="1745596813" 

you can see there is nothing related to tenant in my setup. These gateway labels are confirmed in my Prometheus Gateway webpage and by running npbackup with --debug switch. do you have a tenant label?

GuitarBilly avatar May 11 '25 19:05 GuitarBilly

I do understand that you don't have multitenancy hence no __tenant_id__ label. What I tried to achieve, is replace "all" value with regex .* which should just work. I've actually tried that with a non existing label on my setup.

I don't really get why it doesn't work on your side. I'd really need a teamviewer / whatever to your grafana to understand what happens here.

deajan avatar May 11 '25 19:05 deajan

Do you think we could arrange a remote desktop session to your grafana setup so I can have a quick look ?

deajan avatar May 13 '25 11:05 deajan

@deajan coming back to some of the original filtering questions. 8-)

  1. I see your screenshot for "Snapshot size in restore mode" also has double Legend entries for each backup_job. I played around with the panel but cannot get it fixed. prometheus data seems fine for this query.

Image

  1. Looking at the panel legends, I see you often use {{backup_job}} which is fine as it is available in pushgateway and in the prometheus database. {{backup_type}} which is blank as it is NOT available in any pushgateway/prometheus data. I also do not have that column in my data. Where does backup_type come from?

Image

  1. Now i found out my latest "duplicate" entries are due to the different repos I use, I can make them visible by using {{job}} :: {{npversion}} in the Legend:

Image

unfortunately {{repo_name}} cannot be applied since it is not part of restic_total_duration_seconds but rather in npbackup_* metrics. not sure how to get that in the Legends; I will play around with Grafana a bit more. I did manage to make a working repo_name variable for filtering.

  1. My graphs with Snapshot size in restore mode" has been empty forever (see my very first screenshots on top), which is interesting because the panel Legend manages to show the "Last" value correctly and my prometheus does have all restic_snasphot_size_bytes data to make a graph:

Image

to be continued...

GuitarBilly avatar May 31 '25 09:05 GuitarBilly

  1. I see your screenshot for "Snapshot size in restore mode" also has double Legend entries for each backup_job. I played around with the panel but cannot get it fixed. prometheus data seems fine for this query.

In my case I updated my NPBackup version, so the metric got a change in npbackup_version which results in a new metric, hence showing old / new values for some time.

The thing here is that the pushgateway doesn't automatically remvoe earlier metrics unless rebooted. There are multiple requests for pushgateway stale metrics removal, but the prometheus team specifically doesn't want to implmement them because it would make prometheus a push system, see their readme for more explanation. So in the end, you have double metrics.

  1. Looking at the panel legends, I see you often use {{backup_job}} which is fine as it is available in pushgateway and in the prometheus database. {{backup_type}} which is blank as it is NOT available in any pushgateway/prometheus data. I also do not have that column in my data. Where does backup_type come from?

{{backup_type}} is my own way to add various labels for my backups. It's a free string I use to add to my config files, eg:

global_prometheus:
  metrics: true
  additional_labels:
    backup_type: vm
  1. (o_O)
  1. [...] unfortunately {{repo_name}} cannot be applied since it is not part of restic_total_duration_seconds but rather in npbackup_* metrics. not sure how to get that in the Legends; I will play around with Grafana a bit more. I did manage to make a working repo_name variable for filtering.

Indeed, restic basically ignores what the repo name is, hence only npbackup metrics got this label. I added it to all labels in https://github.com/netinvent/npbackup/commit/eb2b2482849e11b15bfa4f110ce63a29c1d9c919

  1. My graphs with Snapshot size in restore mode" has been empty forever (see my very first screenshots on top), which is interesting because the panel Legend manages to show the "Last" value correctly and my prometheus does have all restic_snasphot_size_bytes data to make a graph:

I honestly have not enough data to know what happens at your side. AFAIK, you told me that your setup is a container with grafana / prometheus / pushgateway. In order to replicate, is your container uptodate, and do you have any special configs running ?

deajan avatar Jun 10 '25 16:06 deajan

@deajan

  1. i think I agree with your observation. since my prior message the legend items have cleared up or are grouped due to repo_name which is understood.

  2. i see you already have some additional label info in the wiki. I was able to add my backup_type label which now also show in the table columns and graph-legend.

  3. I have shared my docker setup files with you via mail, hopefully it can help you to create a test environment.

GuitarBilly avatar Jun 12 '25 19:06 GuitarBilly

Thanks for the followup. I'll try to deploy a docker and see what I can achieve from there.

deajan avatar Jun 13 '25 10:06 deajan

Hello @GuitarBilly

So after a long period, I could take time to see why the NPBackup dashboard doesn't work when there are is no __tenant_id__ label. The error was fairly simple altough I took some time to find it.

When filtering backup_job by tenant, I should have used =~ instead of = comparator.

Image

I've uploaded a new version of the Dashboard. Could you confirm that it works on your setup ? Sorry for the (very) long delay.

deajan avatar Aug 21 '25 11:08 deajan

sure I will give it a try. Will need to check my client configs as I created some custom labels 'tenant' and 'tenant_id' in an attempt to fool the dashboard. 8-)

GuitarBilly avatar Aug 22 '25 21:08 GuitarBilly

OK, so I got impatient. imported your latest dashboard and.... things seem to work.

  1. no modification required with any label, all data shows.
  2. my "Snapshot size in restore mode" graph finally shows !
Image

GuitarBilly avatar Aug 22 '25 22:08 GuitarBilly

\o/ I'm really happy that this finally has been sorted out. Although it was a stupidly simple typo, it was quite hard to track down for me.

Thank you for the feedback.

deajan avatar Aug 22 '25 23:08 deajan

@deajan likewise, thanks for sticking with me.

GuitarBilly avatar Aug 23 '25 20:08 GuitarBilly