question : Grafana template filtering
grafana template : NPBackup v3 20250306
Unfiltered:
All config are set up with a Machine Identifier structure like ${HOSTNAME}__${RANDOM}[4].
The x300 machine was run with multiple config files, that is visible in the Backup_Job filter:
Exceptions:
- Some items for the x300 machine appear duplicated with the same Machine Identifier. Should that not be the other randomized x300 Machine ID? Or is the ${ Backup_Job} somehow wrong?
- In the Unfiltered view you can see one additional entry with the same Machine Identifier for each machine in the "Snapshot size in restore mode" panel.
Filtered:
When I filter on pop-os at backup_job, then most panels are taking this setting.
Exception:
- Global Tenant panels. here all backup_job are still visible.
Question
Could you check at your side if you see the same behaviour with this Grafana template?
Did you by any chance restart the prometheus gateway in between runs ?
@deajan, the answer to that is yes. I could not remember so I verified the status. The last "Started" was 2025-04-10 21:24:31, that is close to where the gap begins in "last 30 days" from above graph (left half of data ends April 10 23:29:00).
Using push gateway, metrics stay until push gateway is restarted (which IMO is a bad design choice, but yet here we are). So once restarted, the same metric will become a "new series" in prometheus.
In any case, metrics become "new series" when stall more than 5 minutes in prometheus.
Does this answer your question ?
@deajan I think it does for the duplicate series (item 2), but I had two other observations.
- x300werkstation is noted 4 times with _nFkn instead of expected four available configs:
- no filtering in the Global Tenant stats.
I can debug item 1 more myself, item 3 should be easy for you to check?
For point 3: As it's name suggests, it's a global display that uses the same metrics as shown above with filters, but tenant wide without any other filters, so there's nothing wrong for me. Since you don't have tenants, of course you don't have utility for that part of the dashboard perhaps.
I'd still like to know why your setup doesn't "like" tenants, even in mono-tenant mode, you shoud have a default tenant value.
@deajan, I'd like to focus on the Tenant item, the others seem not related to npbackup. How to go about and investigate that? I'm willing to experiment with my setup (described here: https://github.com/netinvent/npbackup/issues/153#issuecomment-2779795651)
The only thing I can think about now is that on most client machines I filled in values for Machine Group by hand, i.e.
background info
I believe my Prometheus instance configuration is default,
The metrics destination used for all machines is like this :
http://my_prometheus_pushgateway:9091/metrics/job/${BACKUP_JOB}
My clients do not all run the same npbackup version; oldest version is v3.0.0-rc15.
Another link to the other issue https://github.com/netinvent/npbackup/issues/153#issue-2962458957 where I "reconfigured the "backup_job" variable to remove tenant label filter" is that my Grafana dashboard is lacking the __tenant_id__ column, while the one on the github page has it included.
mine:
npbackup github:
TODO
- Replace my dashboard with the latest from github and check the columns. EDIT: done, my dashboard is/was the latest v3 2025.03.06. without the tenant filter edit i see no columns at all.
8-)after the edit i see what i pasted above, no tenant_id, but see below on the available metrics labels that i see. - Where to I find my default
__tenant_id__in the pushgateway metrics? here thenpbackup_exec_statelabels:action="backup" backup_job="redacted" group="redacted" instance="redacted" job="redacted" npversion="npbackup3.0.0-rc15-gui" repo_name="default" timestamp="1745596813"
Sorry for the late reply, had a fullscale datacenter disaster recovery transition to planify / execute these days.
I guess you just don't have the tenant_id label since you're running mono-tenancy. Does the tenant variable in your dashboard have the "all" value set ?
@deajan , yes it has the "all" but nothing else:
Would you mind trying to add .* as custom all value to the tenant variable and tell me if it resolves your issue ?
Done and no change, but what would it do if I do not have a tenant label in my pushgateway? Ultimately what npbackup pushes to the prometheus pushgateway should be what ends up in the prometheus database, right? As posted earlier here are my labels.
action="backup"
backup_job="redacted"
group="redacted"
instance="redacted"
job="redacted"
npversion="npbackup3.0.0-rc15-gui"
repo_name="default"
timestamp="1745596813"
you can see there is nothing related to tenant in my setup. These gateway labels are confirmed in my Prometheus Gateway webpage and by running npbackup with --debug switch. do you have a tenant label?
I do understand that you don't have multitenancy hence no __tenant_id__ label.
What I tried to achieve, is replace "all" value with regex .* which should just work.
I've actually tried that with a non existing label on my setup.
I don't really get why it doesn't work on your side. I'd really need a teamviewer / whatever to your grafana to understand what happens here.
Do you think we could arrange a remote desktop session to your grafana setup so I can have a quick look ?
@deajan coming back to some of the original filtering questions. 8-)
- I see your screenshot for "Snapshot size in restore mode" also has double Legend entries for each backup_job. I played around with the panel but cannot get it fixed. prometheus data seems fine for this query.
- Looking at the panel legends, I see you often use
{{backup_job}}which is fine as it is available in pushgateway and in the prometheus database.{{backup_type}}which is blank as it is NOT available in any pushgateway/prometheus data. I also do not have that column in my data. Where doesbackup_typecome from?
- Now i found out my latest "duplicate" entries are due to the different repos I use, I can make them visible by using
{{job}} :: {{npversion}}in the Legend:
unfortunately {{repo_name}} cannot be applied since it is not part of restic_total_duration_seconds but rather in npbackup_* metrics. not sure how to get that in the Legends; I will play around with Grafana a bit more.
I did manage to make a working repo_name variable for filtering.
- My graphs with Snapshot size in restore mode" has been empty forever (see my very first screenshots on top), which is interesting because the panel Legend manages to show the "Last" value correctly and my prometheus does have all
restic_snasphot_size_bytesdata to make a graph:
to be continued...
- I see your screenshot for "Snapshot size in restore mode" also has double Legend entries for each backup_job. I played around with the panel but cannot get it fixed. prometheus data seems fine for this query.
In my case I updated my NPBackup version, so the metric got a change in npbackup_version which results in a new metric, hence showing old / new values for some time.
The thing here is that the pushgateway doesn't automatically remvoe earlier metrics unless rebooted. There are multiple requests for pushgateway stale metrics removal, but the prometheus team specifically doesn't want to implmement them because it would make prometheus a push system, see their readme for more explanation. So in the end, you have double metrics.
- Looking at the panel legends, I see you often use {{backup_job}} which is fine as it is available in pushgateway and in the prometheus database. {{backup_type}} which is blank as it is NOT available in any pushgateway/prometheus data. I also do not have that column in my data. Where does backup_type come from?
{{backup_type}} is my own way to add various labels for my backups.
It's a free string I use to add to my config files, eg:
global_prometheus:
metrics: true
additional_labels:
backup_type: vm
- (o_O)
- [...] unfortunately {{repo_name}} cannot be applied since it is not part of restic_total_duration_seconds but rather in npbackup_* metrics. not sure how to get that in the Legends; I will play around with Grafana a bit more. I did manage to make a working repo_name variable for filtering.
Indeed, restic basically ignores what the repo name is, hence only npbackup metrics got this label. I added it to all labels in https://github.com/netinvent/npbackup/commit/eb2b2482849e11b15bfa4f110ce63a29c1d9c919
- My graphs with Snapshot size in restore mode" has been empty forever (see my very first screenshots on top), which is interesting because the panel Legend manages to show the "Last" value correctly and my prometheus does have all restic_snasphot_size_bytes data to make a graph:
I honestly have not enough data to know what happens at your side. AFAIK, you told me that your setup is a container with grafana / prometheus / pushgateway. In order to replicate, is your container uptodate, and do you have any special configs running ?
@deajan
-
i think I agree with your observation. since my prior message the legend items have cleared up or are grouped due to repo_name which is understood.
-
i see you already have some additional label info in the wiki. I was able to add my backup_type label which now also show in the table columns and graph-legend.
-
I have shared my docker setup files with you via mail, hopefully it can help you to create a test environment.
Thanks for the followup. I'll try to deploy a docker and see what I can achieve from there.
Hello @GuitarBilly
So after a long period, I could take time to see why the NPBackup dashboard doesn't work when there are is no __tenant_id__ label.
The error was fairly simple altough I took some time to find it.
When filtering backup_job by tenant, I should have used =~ instead of = comparator.
I've uploaded a new version of the Dashboard. Could you confirm that it works on your setup ? Sorry for the (very) long delay.
sure I will give it a try. Will need to check my client configs as I created some custom labels 'tenant' and 'tenant_id' in an attempt to fool the dashboard. 8-)
OK, so I got impatient. imported your latest dashboard and.... things seem to work.
- no modification required with any label, all data shows.
- my "Snapshot size in restore mode" graph finally shows !
\o/ I'm really happy that this finally has been sorted out.
Although it was a stupidly simple typo, it was quite hard to track down for me.
Thank you for the feedback.
@deajan likewise, thanks for sticking with me.