cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Unshipped blocks when out of order writes are enabled

Open AmerSelimovic opened this issue 1 year ago • 13 comments

Describe the bug

Unshipped blocks are shown in the cortex_ingester_oldest_unshipped_block_timestamp_seconds metric and are also visible in the ingester storage when out of order writes are enabled with the configuration introduced in #4964. Blocks are accumulating on the ingester as long as the config is set.

To Reproduce

  1. Start Cortex 1.15.2
  2. Allow out of order writes using newly introduced configuration parameters introduced in #4964
  3. Perform Write operations

Expected behavior

Expecting to see no unshipped blocks on the ingester and have the metric cortex_ingester_oldest_unshipped_block_timestamp_seconds at value 0.

Environment

  • Infrastructure: Kubernetes

  • Deployment tool: Helm

  • Cortex version 1.15.2

  • Chart version 2.1.0

Additional Context

Tested with following two combinations of configurations and they produced the same result.

out_of_order_time_window: 30m
out_of_order_cap_max: 32

and

out_of_order_time_window: 30m
out_of_order_cap_max: 32
skip_blocks_with_out_of_order_chunks_enabled: true

Metrics:

cortex_ingester_shipper_uploads_total shows that block uploads are being done cortex_ingester_shipper_upload_failures_total does not show any failures cortex_compactor_runs_completed_total shows that compactions are being done cortex_compactor_runs_failed_total shows no failed compactions

Logs:

There are no errors in Cortex component logs. Only logs that could point to something are ingester logs regarding blocks overlapping, for example

caller=compact.go:698 org_id=fake msg="Found overlapping blocks during compaction" ulid=01H2SSZF807FMM2FD52HFGA3N7

Alerts:

CortexIngesterHasUnshippedBlocks alert from cortex-jsonnet is triggered as there are unshipped blocks available.

AmerSelimovic avatar Jun 13 '23 08:06 AmerSelimovic

I was just about to raise this same bug

disambiguationuk avatar Jun 15 '23 09:06 disambiguationuk

We have tested various values for out_of_order_time_window including very short ones like 10m and longer ones like 2w and the problem persists.

danieljosephchambers avatar Jun 15 '23 09:06 danieljosephchambers

I will try to take a look on this this week!

cc @yeya24

alanprot avatar Jun 15 '23 21:06 alanprot

Is it because ingester shipper didn't upload compacted blocks? https://github.com/cortexproject/cortex/blob/master/pkg/ingester/ingester.go#L2031

yeya24 avatar Jun 19 '23 20:06 yeya24

Also raised https://github.com/thanos-io/thanos/issues/6462 on Thanos side. I think make shipper upload compacted blocks works, but it might cause other issues (since we cannot identify compacted blocks generated by OOO or others)

yeya24 avatar Jun 21 '23 17:06 yeya24

Is there anything outstanding that's blocking the merge still?

disambiguationuk avatar Jul 11 '23 11:07 disambiguationuk

@disambiguationuk I think we can merge this now https://github.com/cortexproject/cortex/pull/5416, I just need to rebase and resolve conflicts.

With this change https://github.com/cortexproject/cortex/pull/5495/files#diff-e1032332627c413a3010c66b54b22b6e9835cf152fa339e40cf0b11204f7241fR2043 we should be able to upload dynamically

yeya24 avatar Aug 03 '23 17:08 yeya24

Any updates on this fix, is it still being worked on?

AmerSelimovic avatar Oct 20 '23 14:10 AmerSelimovic

Hi @AmerSelimovic, sorry for the delay. The fix should be ready but I want to see if I can verify it first in our testing environment. I should get it done this week.

And if you are willing to test some prebuilt image, it would be very helpful

yeya24 avatar Oct 24 '23 05:10 yeya24

@AmerSelimovic Actually I believe the bug is already fixed. If the tenant has OOO time window > 0 enabled, shipper should upload compacted blocks.

What we are trying to add in https://github.com/cortexproject/cortex/pull/5416 is to turn on/off shipper uploading compacted blocks dynamically in case OOO feature is enabled/disabled during runtime. If OOO is enabled when ingester starts, all blocks can be uploaded successfully.

yeya24 avatar Oct 29 '23 05:10 yeya24

Hi @yeya24.

Not sure what do you propose fixed the reported bug? Because issues were also happening with out_of_order_time_window: 30m

You think it is okay with this change https://github.com/cortexproject/cortex/pull/5495/files#diff-e1032332627c413a3010c66b54b22b6e9835cf152fa339e40cf0b11204f7241fR2043

AmerSelimovic avatar Nov 09 '23 14:11 AmerSelimovic

The fix is to always upload compacted blocks in ingester so OOO compacted blocks can be uploaded to object store

yeya24 avatar Nov 09 '23 16:11 yeya24

Btw https://github.com/cortexproject/cortex/releases/tag/v1.16.0-rc.0 is out. Feel free to try it out and see if it fixes this issue

yeya24 avatar Nov 09 '23 16:11 yeya24

https://github.com/cortexproject/cortex/releases/tag/v1.17.0-rc.0 is out. It should address this issue completely as overlapped blocks will not be compacted by Prometheus anymore. Compactor will handle that.

yeya24 avatar Apr 25 '24 18:04 yeya24