druid icon indicating copy to clipboard operation
druid copied to clipboard

CompactSegments errors after upgrade to 24.0

Open gianm opened this issue 3 years ago • 3 comments

Reported in Slack here: https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1663633728125109. Logs reproduced below.

I think what is happening here is that the changes in #12404 make the task type from the Overlord tasks API return null until the metadata store update finishes in the background. (See SQLMetadataStorageActionHandler#populateTaskTypeAndGroupId.) It happens automatically when the Overlord starts up and logs messages like:

Populate fields task and group_id of task entry table [%s] from payload

And:

Task migration for table [druid_tasks] successful

During this time, the Coordinator is unable to properly filter out non-compaction tasks from the task list. It tries to deserialize non-compaction tasks as ClientTaskQuery, which it can't do. (That class only supports a subset of task types.)

I believe this situation would resolve on its own once the metadata store update is complete.

Logs:

2022-09-19T23:08:35,342 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Caught exception, ignoring so that schedule keeps going.: {class=org.apache.druid.server.coordinator.DruidCoordinator, exceptionType=class java.lang.RuntimeException, exceptionMessage=com.fasterxml.jackson.databind.exc.InvalidTypeIdException: Please make sure to load all the necessary extensions and jars with type 'index_kafka' on 'druid/coordinator' service. Could not resolve type id 'index_kafka' as a subtype of `org.apache.druid.client.indexing.ClientTaskQuery` known type ids = [compact, kill] (for POJO property 'payload')
 at [Source: (String)"{"task":"index_kafka_druid_v2_30a330c471e9073_afjlnglj","payload":{"type":"index_kafka","id":"index_kafka_druid_v2_30a330c471e9073_afjlnglj","resource":{"availabilityGroup":"index_kafka_druid_v2_30a330c471e9073","requiredCapacity":1},"dataSchema":{"dataSource":"druid_v2","timestampSpec":{"column":"timestamp","format":"millis","missingValue":null},"dimensionsSpec":{"dimensions":[{"type":"string","name":"companyID","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},{"type":"string","nam"[truncated 5767 chars]; line: 1, column: 75] (through reference chain: org.apache.druid.client.indexing.TaskPayloadResponse["payload"])}
java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.InvalidTypeIdException: Please make sure to load all the necessary extensions and jars with type 'index_kafka' on 'druid/coordinator' service. Could not resolve type id 'index_kafka' as a subtype of `org.apache.druid.client.indexing.ClientTaskQuery` known type ids = [compact, kill] (for POJO property 'payload')
 at [Source: (String)"{"task":"index_kafka_druid_v2_30a330c471e9073_afjlnglj","payload":{"type":"index_kafka","id":"index_kafka_druid_v2_30a330c471e9073_afjlnglj","resource":{"availabilityGroup":"index_kafka_druid_v2_30a330c471e9073","requiredCapacity":1},"dataSchema":{"dataSource":"druid_v2","timestampSpec":{"column":"timestamp","format":"millis","missingValue":null},"dimensionsSpec":{"dimensions":[{"type":"string","name":"companyID","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},{"type":"string","nam"[truncated 5767 chars]; line: 1, column: 75] (through reference chain: org.apache.druid.client.indexing.TaskPayloadResponse["payload"])
	at org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskPayload(HttpIndexingServiceClient.java:351) ~[druid-server-24.0.0.jar:24.0.0]
	at org.apache.druid.server.coordinator.duty.CompactSegments.run(CompactSegments.java:143) ~[druid-server-24.0.0.jar:24.0.0]
	at org.apache.druid.server.coordinator.DruidCoordinator$DutiesRunnable.run(DruidCoordinator.java:948) ~[druid-server-24.0.0.jar:24.0.0]
	at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:721) ~[druid-server-24.0.0.jar:24.0.0]
	at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:714) ~[druid-server-24.0.0.jar:24.0.0]
	at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$4.run(ScheduledExecutors.java:163) ~[druid-core-24.0.0.jar:24.0.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_275]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_275]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_275]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_275]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_275]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_275]
	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_275]

gianm avatar Sep 21 '22 00:09 gianm

@AmatyaAvadhanula there is a comment in TaskIdentifierMapper:

        // If field is absent (older task version), use blank string to avoid a loop of migration of such tasks.

I wonder if there is some other way to solve the problem mentioned in this comment, so we can include the task type even before the migration is done? That'd fix the Coordinator problem. What do you think?

gianm avatar Sep 21 '22 00:09 gianm

@gianm

// If field is absent (older task version), use blank string to avoid a loop of migration of such tasks.`

This is applicable only to the migration of the tasks table.

SQLMetadataStorageActionHandler#getTaskStatusList processes task queries separately before and after migration and is meant to work during the update as well

AmatyaAvadhanula avatar Sep 21 '22 03:09 AmatyaAvadhanula

I see, in that case my suspected explanation may not be what's actually happening.

gianm avatar Sep 21 '22 05:09 gianm