Optimise Prometheus S3 backup
Currently backup a lot more than needed. Can we optimise what we send to the S3 backup?
To explain a bit, this is the current live listing of our prometheus storage:
BLOCK ULID MIN TIME MAX TIME DURATION NUM SAMPLES NUM CHUNKS NUM SERIES SIZE
01HV7PYRV06EKMG5EJGX95NKRD 2024-03-22 12:00:00 +0000 UTC 2024-04-11 18:00:00 +0000 UTC 485h59m59.972s 162441341145 1358039832 1891699 105GiB677MiB486KiB676B
01HWVVG2QQ7HBG3PQYHVPQ32HQ 2024-04-11 18:00:00 +0000 UTC 2024-05-02 00:00:00 +0000 UTC 485h59m59.972s 161366502626 1348402454 2231708 105GiB738MiB606KiB961B
01HYG021TYK0R68CF8KNZ4X3DG 2024-05-02 00:00:00 +0000 UTC 2024-05-22 06:00:00 +0000 UTC 485h59m59.972s 161891891023 1350709179 1648828 106GiB5MiB593KiB854B
01J044KGT663DCDA5JHEYXQMEE 2024-05-22 06:00:00 +0000 UTC 2024-06-11 12:00:00 +0000 UTC 485h59m59.972s 161659122034 1344446853 1682969 106GiB833MiB163KiB889B
01J1R94V4SVDHTBRTWT1N4HV92 2024-06-11 12:00:00 +0000 UTC 2024-07-01 18:00:00 +0000 UTC 485h59m59.971s 162373700750 1357260875 1776370 106GiB700MiB350KiB683B
01J3CDQ5H1CREVFNVNS9DB3YQR 2024-07-01 18:00:00 +0000 UTC 2024-07-22 00:00:00 +0000 UTC 485h59m59.971s 162456327858 1357452067 1688092 106GiB456MiB575KiB87B
01J50J7NE30QV0J5DGWNRPME9P 2024-07-22 00:00:00 +0000 UTC 2024-08-11 06:00:00 +0000 UTC 485h59m59.971s 161573958302 1349918951 1673259 105GiB945MiB224KiB602B
01J6MPS5XAFQ86GYBVAE7W0C4M 2024-08-11 06:00:00 +0000 UTC 2024-08-31 12:00:00 +0000 UTC 485h59m59.971s 161682386999 1351098202 1913633 106GiB365MiB814KiB693B
01J88VB2RHFJS74D6AGYZDKR8P 2024-08-31 12:00:00 +0000 UTC 2024-09-20 18:00:00 +0000 UTC 485h59m59.971s 162537219885 1358344461 1862225 108GiB14MiB662KiB601B
01J9WZX6H8ERFMCR5ZFE76PSF4 2024-09-20 18:00:00 +0000 UTC 2024-10-11 00:00:00 +0000 UTC 485h59m59.971s 166073207118 1387802075 1883579 109GiB411MiB336KiB821B
01JBH4ERRAZGQT99CHPQ8R55B2 2024-10-11 00:00:00 +0000 UTC 2024-10-31 06:00:00 +0000 UTC 485h59m59.971s 168475719637 1407670452 1756051 110GiB824MiB123KiB209B
01JD590FA7A5X9EPAXWWV5QFQ3 2024-10-31 06:00:00 +0000 UTC 2024-11-20 12:00:00 +0000 UTC 485h59m59.971s 167755433175 1401731472 2395345 113GiB43MiB366KiB198B
01JESDJ4675TBS6TPV4NG77R2T 2024-11-20 12:00:00 +0000 UTC 2024-12-10 18:00:00 +0000 UTC 485h59m59.971s 168309268213 1406269504 1664327 116GiB78MiB738KiB443B
01JGDJ3S9F4HWBZ9FNPFNG3K4G 2024-12-10 18:00:00 +0000 UTC 2024-12-31 00:00:00 +0000 UTC 485h59m59.878s 163832463513 1368474332 1726922 111GiB668MiB652KiB443B
01JJ1PMQZ0XX7NHCXR06X7GSFW 2024-12-31 00:00:00 +0000 UTC 2025-01-20 06:00:00 +0000 UTC 485h59m59.878s 165987486674 1387538098 1697840 110GiB928MiB720KiB412B
01JKNV6D56PX0VJNKEH458A6QK 2025-01-20 06:00:00 +0000 UTC 2025-02-09 12:00:00 +0000 UTC 485h59m59.795s 162794120196 1360614092 1618585 111GiB1002MiB487KiB674B
01JN9ZQDNY1A4T4EEBNBDGMSEH 2025-02-09 12:00:00 +0000 UTC 2025-03-01 18:00:00 +0000 UTC 485h59m59.795s 162224716013 1354949845 1609941 110GiB247MiB751KiB95B
01JPY496R0FPBQTSH0QDSZ6KQ6 2025-03-01 18:00:00 +0000 UTC 2025-03-22 00:00:00 +0000 UTC 485h59m59.795s 162031809809 1353898292 1698229 109GiB511MiB629KiB359B
01JRJ8TZHYNDH1GZDK8KYFGDKB 2025-03-22 00:00:00 +0000 UTC 2025-04-11 06:00:00 +0000 UTC 485h59m59.795s 163281129208 1364549752 1723569 108GiB1011MiB496KiB564B
01JT6DCDRTR14AKP57ZZEZ5EK9 2025-04-11 06:00:00 +0000 UTC 2025-05-01 12:00:00 +0000 UTC 485h59m59.795s 163273695534 1363235863 1654564 107GiB677MiB715KiB110B
01JTQS84X74XF99GXHB58GR365 2025-05-01 12:00:00 +0000 UTC 2025-05-08 06:00:00 +0000 UTC 161h59m59.795s 51403124251 431166410 1618222 33GiB544MiB696KiB406B
01JV95E56QP89BF9P7N7YTGWW5 2025-05-08 06:00:00 +0000 UTC 2025-05-15 00:00:00 +0000 UTC 161h59m59.795s 54280780359 454908328 1588896 35GiB69MiB702KiB1018B
01JV9STBARBXCPASFFQD3MQ1AP 2025-05-15 00:00:00 +0000 UTC 2025-05-15 06:00:00 +0000 UTC 5h59m59.795s 1987303520 16614635 1532920 1GiB382MiB747KiB639B
01JVAEEW3QDTRC5GF668Q9DGEB 2025-05-15 06:00:00 +0000 UTC 2025-05-15 12:00:00 +0000 UTC 5h59m59.795s 1997349892 16697642 1538651 1GiB432MiB870KiB118B
01JVA7G5HZ9RS354T76CXM9C6T 2025-05-15 12:00:00 +0000 UTC 2025-05-15 14:00:00 +0000 UTC 1h59m59.795s 666811052 5426419 1539125 579MiB734KiB757B
01JVAEDBQH49PR6DAGH0V52V2P 2025-05-15 14:00:00 +0000 UTC 2025-05-15 16:00:00 +0000 UTC 1h59m59.795s 667223513 5578419 1538947 563MiB497KiB959B
01JVAN7P0E80PQ3DQ96PH3HBT6 2025-05-15 16:00:00 +0000 UTC 2025-05-15 18:00:00 +0000 UTC 1h59m59.795s 668484551 5588999 1541650 568MiB301KiB486B
Each day we take a snapshot of that and sync it to S3 as backup.
The problem is that while the 486h blocks are complete the smaller blocks in the last few weeks are intermediates that get rolled up into large blocks and then removed until there is a new 486h one. This means that S3 contains duplicate data and it's also hard to work out what to restore.
So the idea is to try and split the snapshot into full blocks and non-full blocks and send them to separate buckets - the second bucket could then expire anything more than a month old as we only need the last few weeks as backup for things since the last full block.
Here is a document which describes how to make a TSDB Snapshot for backup: https://gist.github.com/ksingh7/d5e4414d92241e0802e59fa4c585b98b
What exactly do you think we're doing? Exactly that!