druid icon indicating copy to clipboard operation
druid copied to clipboard

Datasource Autocompaction stuck 'Awaiting First Run'

Open gitmstoute opened this issue 3 years ago • 2 comments
trafficstars

Affected Version

0.22.1

Description

I'm opening a ticket after having no luck with my druidforum post: https://www.druidforum.org/t/datasource-autocompaction-awaiting-first-run/7087.

I’m trying to use Autocompaction to update a datasource segmentGranularity from ‘HOUR’ to ‘WEEK’. I’m testing with a small datasource (15mb / ~400,000 rows) which has many segments (5,000+).

I submit a compaction config like:

{
  "dataSource": "my_datasource",
  "taskPriority": 25,
  "inputSegmentSizeBytes": 419430400,
  "skipOffsetFromLatest": "P1M",
  "tuningConfig": {
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 5000000
    },
    "type": "index_parallel",
    "forceGuaranteedRollup": false
  },
  "granularitySpec": {
    "segmentGranularity": "WEEK"
  }
}
  • Both the web console and curl against the api (e.g. curl get against /druid/coordinator/v1/config/compaction/ ) show that my compaction config has been set/accepted.
  • There are segments going back to 2021-05, so there are plenty of candidate segments to compact based on skipOffsetFromLatest.
  • There are 3 worker slots available and relevant compaction TaskSlot variables are: "compactionTaskSlotRatio" : 0.7, "maxCompactionTaskSlots" : 2147483647

I’m expecting autocompaction to start compacting based on this config.

Instead, I find that in the unified-console, the dataSource Compaction column shows Awaiting first run and has been stuck like this for 24h+. Reviewing Coordinator logs, I don’t see any indication that compaction is even being attempted. No errors. If I grep “compact” in coordinator logs, nothing is returned except for a single line at startup indicating the coordinator is scheduling compaction tasks.

I made a branch in my fork with a docker-compose, ingestion spec, and compaction spec which seems to reproduce the issue. Steps:

  1. run it: docker-compose up
  2. submit the ingestion spec: curl -H 'Content-Type:application/json' -d "@ingestion-spec.json" -X POST http://localhost:8081/druid/indexer/v1/task
  3. Let the datasource become fully available
  4. submit the compaction spec: curl -H 'Content-Type:application/json' -d "@compaction-spec.json" -X POST http://localhost:8081/druid/coordinator/v1/config/compaction

I have tried many tweaks to the compaction config (e.g. segmentGranularity to day instead of week) but haven't been able to get compaction to start. There do not seem to be any logs indicating autocompaction tasks are being created / attempted. We have also tried restarting our cluster (and restarting the docker-compose also does not resolve the issue).

If there are any more details I can provide, please let me know!

gitmstoute avatar Feb 28 '22 16:02 gitmstoute

hey @gitmstoute, I had exactly the same problem with the same version. To unblock the situation, I deleted an old database entry in druid_pendingsegments table that should not have remained:

druid_db=# select * from druid_pendingsegments; -[ RECORD 1 ]- id | xxx_2022-07-26T00:00:00.000Z_2022-07-26T01:00:00.000Z_2022-08-03T04:44:16.814Z_2427 datasource | xxx created_date | 2022-09-14T09:27:55.647Z start | 2022-07-26T00:00:00.000Z end | 2022-07-26T01:00:00.000Z sequence_name | index_kafka_xxx_44c72010954e750_0 sequence_prev_id | sequence_name_prev_id_sha1 | E6Exxx payload | \xxxx

druid_db=# delete from druid_pendingsegments where id='xxx_2022-07-26T00:00:00.000Z_2022-07-26T01:00:00.000Z_2022-08-03T04:44:16.814Z_2427';

baaaaast avatar Sep 23 '22 09:09 baaaaast

Thank you for the update @baaaaast! Sounds promising. I will investigate this and update!

gitmstoute avatar Sep 23 '22 13:09 gitmstoute

exactly the same and also on the same version of druid: 0.22.1.

linehrr avatar Nov 10 '22 14:11 linehrr

Thanks @linehrr I was starting to think I was going crazy. @baaaaast even with the docker start scripts (should have an empty druid_pendingsegments table) I'm still unable to get autocompaction to kick off. Time is tight to keep investigating much further on my end.

gitmstoute avatar Nov 10 '22 16:11 gitmstoute

image

it basically looks like this, and it's just stuck there forever. coordinator logs look fine. to me this looks like the segment compaction calculation/planning thread stuck.

linehrr avatar Nov 10 '22 18:11 linehrr

2022-11-10T18:51:32,702 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Caught exception, ignoring so that schedule keeps going.: {class=org.apache.druid.server.coordinator.DruidCoordinator, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Error while killing pendingSegments of dataSource[Forecasting_geo] created until [2022-11-09T18:10:12.359Z]}

java.lang.RuntimeException: org.apache.druid.java.util.common.ISE: Error while killing pendingSegments of dataSource[Forecasting_geo] created until [2022-11-09T18:10:12.359Z]

    at org.apache.druid.client.indexing.HttpIndexingServiceClient.killPendingSegments(HttpIndexingServiceClient.java:415) ~[druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.server.coordinator.KillStalePendingSegments.run(KillStalePendingSegments.java:79) ~[druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.server.coordinator.DruidCoordinator$DutiesRunnable.run(DruidCoordinator.java:910) [druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:720) [druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:713) [druid-server-0.22.1.jar:0.22.1]

    at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$4.run(ScheduledExecutors.java:163) [druid-core-0.22.1.jar:0.22.1]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]

    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_212]

    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_212]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]

    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

Caused by: org.apache.druid.java.util.common.ISE: Error while killing pendingSegments of dataSource[Forecasting_geo] created until [2022-11-09T18:10:12.359Z]

    at org.apache.druid.client.indexing.HttpIndexingServiceClient.killPendingSegments(HttpIndexingServiceClient.java:403) ~[druid-server-0.22.1.jar:0.22.1]

this is something useful we just found. we saw this error log on the coordinator, regarding error deleting pendingSegment of a datasource. then we suspended the ingestion tasks and cleaned up the pendingSegment table. after that coordinator is unblocked and start to issue the compaction tasks. I think this is very related to the issue, hopefully this can provide some good clue during the investigation.

linehrr avatar Nov 10 '22 20:11 linehrr

ok after we cleaned up the pendingsegment basically by removing the broken records in the table, coordinator is now functioning ok and scheduling the auto-compaction now. so I do think there is a bug somewhere so that first some pending segments can't be killed for some reason and also if this happens, it will block coordinator from scheduling compaction.

linehrr avatar Nov 11 '22 19:11 linehrr

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Dec 10 '23 00:12 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

github-actions[bot] avatar Jan 08 '24 00:01 github-actions[bot]