promdump icon indicating copy to clipboard operation
promdump copied to clipboard

Found unsequential head chunk files

Open akrzos opened this issue 2 years ago • 6 comments

I am running promdump in a test environment and occasionally I can not run the meta command because I get the following error message:

# kubectl promdump meta -n openshift-monitoring -p prometheus-k8s-0 -c prometheus -d /prometheus
time=2022-05-19T15:49:05Z caller=level.go:63 level=error error="found unsequential head chunk files chunks_head/000010 (index: 10) and chunks_head/000012 (index: 12)"
failed to exec command: command terminated with exit code 1

Is it possible to make promdump tolerate of the reported condition of unsequential head chunk files?

akrzos avatar May 19 '22 15:05 akrzos

Although we can update meta to ignore the error, the restore will likely fail anyway. I added a section of this under the FAQ (search for Q: The promdump meta and promdump restore subcommands are failing with this error:). Best bet is to remove chunks_head/000010 from the dump file. It will cause some "head chunks" to be lost, but majority of the data in the persistence blocks will still be preserved.

I normally only see this in OpenShift. Is that what you are using?

ihcsim avatar May 19 '22 18:05 ihcsim

I normally only see this in OpenShift. Is that what you are using?

Yes this is OpenShift.

Is there something that causes unsequential head chunk files in the first place?

akrzos avatar Jun 06 '22 17:06 akrzos

I never quite figure that out. I suspect it has something do with parts of CMO that I don''t understand. FWIW, I also haven't had a chance to try promdump with Thanos, Cortex etc. To-date, I haven't seen disjoint chunk files like that in prom.

ihcsim avatar Jun 06 '22 17:06 ihcsim

hmm, perhaps we could just add a command/option to trim/remove unsequential head chunks:

# oc -n openshift-monitoring rsh prometheus-k8s-1 ls -l /prometheus/chunks_head/
total 376836
-rw-r--r--. 1 nobody nobody 123971101 Jun  8 13:00 000010
-rw-r--r--. 1 nobody nobody 113242593 Jun  8 14:30 000011
-rw-r--r--. 1 nobody nobody         8 Jun  8 14:39 000012
# oc -n openshift-monitoring rsh prometheus-k8s-0 ls -l /prometheus/chunks_head/
total 409604
-rw-rw-rw-. 1 nobody nobody         8 Jun  8 04:39 000006
-rw-r--r--. 1 nobody nobody 123533546 Jun  8 13:00 000010
-rw-r--r--. 1 nobody nobody 121632593 Jun  8 14:33 000011
# kubectl promdump meta -n openshift-monitoring -p prometheus-k8s-0 -c prometheus -d /prometheus
time=2022-06-08T14:40:03Z caller=level.go:63 level=error error="found unsequential head chunk files chunks_head/000006 (index: 6) and chunks_head/000010 (index: 10)"
failed to exec command: command terminated with exit code 1
# kubectl promdump meta -n openshift-monitoring -p prometheus-k8s-1 -c prometheus -d /prometheus
Head Block Metadata
------------------------
Minimum time (UTC): | 2022-06-08 12:00:00
Maximum time (UTC): | 2022-06-08 14:40:28
Number of series    | 623197

Persistent Blocks Metadata
----------------------------
Minimum time (UTC):     | 2022-06-07 19:58:43
Maximum time (UTC):     | 2022-06-08 12:00:00
Total number of blocks  | 4
Total number of samples | 902497366
Total number of series  | 2582714
Total size              | 1286064951

My guess is that maybe this is related to the fact there are two running prometheus instances?

After trimming the file:

# oc -n openshift-monitoring rsh prometheus-k8s-0
sh-4.4$ ls -l
total 20
drwxr-sr-x. 3 nobody nobody    68 Jun  8 09:00 01G518NHTWV39BT234Q25DMFCB
drwxr-sr-x. 3 nobody nobody    68 Jun  8 11:00 01G51FH92WH3E7BW3W57JW7ZNG
drwxr-sr-x. 3 nobody nobody    68 Jun  8 11:00 01G51FJA7S7FXK0NAB9JYP3KHG
drwxr-sr-x. 3 nobody nobody    68 Jun  8 13:00 01G51PD0AV8X4PW2469Q6RG84Q
drwxr-sr-x. 2 nobody nobody    48 Jun  8 13:00 chunks_head
-rw-r--r--. 1 nobody nobody     0 Jun  7 19:58 lock
-rw-r--r--. 1 nobody nobody 20001 Jun  8 14:40 queries.active
drwxr-sr-x. 3 nobody nobody   145 Jun  8 14:27 wal
sh-4.4$ ls -l chunks_head
total 409604
-rw-rw-rw-. 1 nobody nobody         8 Jun  8 04:39 000006
-rw-r--r--. 1 nobody nobody 123533546 Jun  8 13:00 000010
-rw-r--r--. 1 nobody nobody 121632593 Jun  8 14:33 000011
sh-4.4$ rm -rf chunks_head/000006
sh-4.4$ exit
exit
# kubectl promdump meta -n openshift-monitoring -p prometheus-k8s-0 -c prometheus -d /prometheus
Head Block Metadata
------------------------
Minimum time (UTC): | 2022-06-08 12:00:00
Maximum time (UTC): | 2022-06-08 14:41:53
Number of series    | 623215

Persistent Blocks Metadata
----------------------------
Minimum time (UTC):     | 2022-06-07 19:58:43
Maximum time (UTC):     | 2022-06-08 12:00:00
Total number of blocks  | 4
Total number of samples | 902569130
Total number of series  | 2588592
Total size              | 1254129156

akrzos avatar Jun 08 '22 14:06 akrzos

Not a bad idea. I probably won't have time for it for the next few weeks. Will you be interested in putting a PR together?

ihcsim avatar Jun 11 '22 03:06 ihcsim

I would like to make a contribution here, just not sure when I can carve out the time. I will keep you posted.

akrzos avatar Jun 13 '22 14:06 akrzos