thanos
thanos copied to clipboard
Improve compatibility testing against Prometheus and TSDB.
Let's start with acceptance criteria for our compatibity tests:
- I would like to know with which versions of Prometheus Thanos supports on each PR.
But what we mean by supports? We have essentially 2 points of contact:
- TSDB format (including very low level index and metadata scheme)
- HTTP API (
api/v1/flags,api/v1/config,api/v1/label/,api/v1/read, api/v1/snapshot` etc)
Storage format can change but is versioned (index and metadata separatedly e.g index version changed somewhere between 2.0 and 2.2.1), HTTP API should not for v1 but things get added (e.g api/v1/flags was added in v2.2.1, snaphot endpoint was extended etc)
Goal: Support all minor Prometheus versions (e.g 2.0, 2.2, ... 2.7.. etc) There are expections. For example broken Prometheus releases like 2.1.x. This means that we would like to test and support to tip of minor version (e.g 2.4.3 for 2.4).
How we test this now?
Now (Before https://github.com/improbable-eng/thanos/pull/704 PR or https://github.com/improbable-eng/thanos/pull/730 lands, depending which will land first), our current method for testing compatibility is to perform on CI:
SUPPORTED_PROM_VERSIONS ?=v2.2.1 v2.3.2 v2.4.3 v2.5.0
@for ver in $(SUPPORTED_PROM_VERSIONS); do \
THANOS_TEST_PROMETHEUS_PATH="prometheus-$$ver" THANOS_TEST_ALERTMANAGER_PATH="alertmanager-$(ALERTMANAGER_VERSION)" go test $(shell go list ./... | grep -v /vendor/ | grep -v /benchmark/); \
done
This runs ALL our tests with different THANOS_TEST_PROMETHEUS_PATH var which controlled which Prometheus binary is used for our e2e tests (we have quite few of them. All tests that ends up with e2e suffix in name). This tests were fine to check if we support our common points as mentioned above.
The problems we see:
- Upgrade of TSDB Golang dependencies (like here) blocks our ability to do any advance testing methods like injecting blocks here or here. This is because, obviously as new Promethus versions are backward compatible with old TSDB format versions, the old Prometheus versions are not forward compatible with new format.
- We run ALL tests against different Prometheus versions using external for loop. This means:
- We might hit golang using cache all the time, because changing some environment variable is not seen by caching logic and it can assume code being not changed, thus cache being used.
- Something is wrong with signal handling, as some tests are green, but actually should fail: https://circleci.com/gh/improbable-eng/thanos/1935?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
- What if config/flags change in some Prometheus version?
Note that upgrading TSDB and Prometheus dependencies is essential to stay up to date with fixes and recent optimizations. We reuse lots of packages.
Extra: As a nice-to-have we would like to make sure anyone can grab TSDB block from object storage to Prometheus and use it there. This means that we need test if compactor produced block is compatible with Prometheus and if yes, with what version (aiming for just latest is fine). How to test that?
Proposed solution:
- As we clearly have problem mostly in our tests and only around tooling (injecting block), I would propose adding another TSDB dependency pinned to version used by Prometheus 2.0 to be sure that those blocks will be understandable by all Prometheus versions (Prometheus will rewrite those anyway), so we can run test smoothly. We potentially can still differentiate to
CreateBlock2_0vsCreateBlockdone by to different TSDBs. - Instead of for looping in makefile, for loop in code like here: https://github.com/improbable-eng/thanos/blob/bbd0b52490f0352d7e0fd7dabd5648dfc7717623/pkg/testutil/prometheus.go#L88 We might store supported version in code as well and add option to limit to certain version for some tests to make this explicit. That should remove all the concerns around caching, signal handling and different tests for different range of Prometheus versions. TBD: How to install Prometheus versions if list of those are in Golang code?
Any thoughts?
A quick idea to solve those problems:
We might hit golang using cache all the time, because changing some environment variable is not seen by caching logic and it can assume code being not changed, thus cache being used.
Could we execute go clean -testcache first for each loop?
Something is wrong with signal handling, as some tests are green, but actually should fail
Add or operator to check testing result:
THANOS_TEST_PROMETHEUS_PATH="prometheus-$$ver" THANOS_TEST_ALERTMANAGER_PATH="alertmanager-$(ALERTMANAGER_VERSION)" go test $(shell go list ./... | grep -v /vendor/ | grep -v /benchmark/) || exit 1;
Agree, but controlling with what version of Prometheus to run ALL tests might be not enough as some tests requires some versions by design.. But we could skip tests that has versions below of something (for code that does not need to be backward compatible) (: That would work.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We still need that ^^ (:
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.
Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
I have some ideas that are slowly unblocked with the work on https://github.com/bwplotka/gobin (:
Otherwise no movement, but would be nice to have.
Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Closing for now as promised, let us know if you need this to be reopened! 🤗
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
@GiedriusS is this still needed? If needed should I give it a try?
@GiedriusS @bwplotka I would like to work on this issue can you assign it to me?