druid Datasource Autocompaction stuck 'Awaiting First Run'

trafficstars

Affected Version

0.22.1

Description

I'm opening a ticket after having no luck with my druidforum post: https://www.druidforum.org/t/datasource-autocompaction-awaiting-first-run/7087.

I’m trying to use Autocompaction to update a datasource segmentGranularity from ‘HOUR’ to ‘WEEK’. I’m testing with a small datasource (15mb / ~400,000 rows) which has many segments (5,000+).

I submit a compaction config like:

{
  "dataSource": "my_datasource",
  "taskPriority": 25,
  "inputSegmentSizeBytes": 419430400,
  "skipOffsetFromLatest": "P1M",
  "tuningConfig": {
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 5000000
    },
    "type": "index_parallel",
    "forceGuaranteedRollup": false
  },
  "granularitySpec": {
    "segmentGranularity": "WEEK"
  }
}

Both the web console and curl against the api (e.g. curl get against /druid/coordinator/v1/config/compaction/ ) show that my compaction config has been set/accepted.
There are segments going back to 2021-05, so there are plenty of candidate segments to compact based on skipOffsetFromLatest.
There are 3 worker slots available and relevant compaction TaskSlot variables are: "compactionTaskSlotRatio" : 0.7, "maxCompactionTaskSlots" : 2147483647

I’m expecting autocompaction to start compacting based on this config.

Instead, I find that in the unified-console, the dataSource Compaction column shows Awaiting first run and has been stuck like this for 24h+. Reviewing Coordinator logs, I don’t see any indication that compaction is even being attempted. No errors. If I grep “compact” in coordinator logs, nothing is returned except for a single line at startup indicating the coordinator is scheduling compaction tasks.

I made a branch in my fork with a docker-compose, ingestion spec, and compaction spec which seems to reproduce the issue. Steps:

run it: docker-compose up
submit the ingestion spec: curl -H 'Content-Type:application/json' -d "@ingestion-spec.json" -X POST http://localhost:8081/druid/indexer/v1/task
Let the datasource become fully available
submit the compaction spec: curl -H 'Content-Type:application/json' -d "@compaction-spec.json" -X POST http://localhost:8081/druid/coordinator/v1/config/compaction

I have tried many tweaks to the compaction config (e.g. segmentGranularity to day instead of week) but haven't been able to get compaction to start. There do not seem to be any logs indicating autocompaction tasks are being created / attempted. We have also tried restarting our cluster (and restarting the docker-compose also does not resolve the issue).

If there are any more details I can provide, please let me know!

Feb 28 '22 16:02 gitmstoute

hey @gitmstoute, I had exactly the same problem with the same version. To unblock the situation, I deleted an old database entry in druid_pendingsegments table that should not have remained:

druid_db=# delete from druid_pendingsegments where id='xxx_2022-07-26T00:00:00.000Z_2022-07-26T01:00:00.000Z_2022-08-03T04:44:16.814Z_2427';

Sep 23 '22 09:09 baaaaast

Thank you for the update @baaaaast! Sounds promising. I will investigate this and update!

Sep 23 '22 13:09 gitmstoute

exactly the same and also on the same version of druid: 0.22.1.

Nov 10 '22 14:11 linehrr

Thanks @linehrr I was starting to think I was going crazy. @baaaaast even with the docker start scripts (should have an empty druid_pendingsegments table) I'm still unable to get autocompaction to kick off. Time is tight to keep investigating much further on my end.

Nov 10 '22 16:11 gitmstoute

it basically looks like this, and it's just stuck there forever. coordinator logs look fine. to me this looks like the segment compaction calculation/planning thread stuck.

Nov 10 '22 18:11 linehrr

2022-11-10T18:51:32,702 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Caught exception, ignoring so that schedule keeps going.: {class=org.apache.druid.server.coordinator.DruidCoordinator, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Error while killing pendingSegments of dataSource[Forecasting_geo] created until [2022-11-09T18:10:12.359Z]}

java.lang.RuntimeException: org.apache.druid.java.util.common.ISE: Error while killing pendingSegments of dataSource[Forecasting_geo] created until [2022-11-09T18:10:12.359Z]

    at org.apache.druid.client.indexing.HttpIndexingServiceClient.killPendingSegments(HttpIndexingServiceClient.java:415) ~[druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.server.coordinator.KillStalePendingSegments.run(KillStalePendingSegments.java:79) ~[druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.server.coordinator.DruidCoordinator$DutiesRunnable.run(DruidCoordinator.java:910) [druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:720) [druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:713) [druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$4.run(ScheduledExecutors.java:163) [druid-core-0.22.1.jar:0.22.1]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]

    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_212]

    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_212]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]

    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

Caused by: org.apache.druid.java.util.common.ISE: Error while killing pendingSegments of dataSource[Forecasting_geo] created until [2022-11-09T18:10:12.359Z]

    at org.apache.druid.client.indexing.HttpIndexingServiceClient.killPendingSegments(HttpIndexingServiceClient.java:403) ~[druid-server-0.22.1.jar:0.22.1]

this is something useful we just found. we saw this error log on the coordinator, regarding error deleting pendingSegment of a datasource. then we suspended the ingestion tasks and cleaned up the pendingSegment table. after that coordinator is unblocked and start to issue the compaction tasks. I think this is very related to the issue, hopefully this can provide some good clue during the investigation.

Nov 10 '22 20:11 linehrr

ok after we cleaned up the pendingsegment basically by removing the broken records in the table, coordinator is now functioning ok and scheduling the auto-compaction now. so I do think there is a bug somewhere so that first some pending segments can't be killed for some reason and also if this happens, it will block coordinator from scheduling compaction.

Nov 11 '22 19:11 linehrr

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

Dec 10 '23 00:12 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

Jan 08 '24 00:01 github-actions[bot]

druid druid copied to clipboard

Datasource Autocompaction stuck 'Awaiting First Run'

Affected Version

Description

druid
druid copied to clipboard