dubbo icon indicating copy to clipboard operation
dubbo copied to clipboard

[Bug] Metadata retry thread may cause no provider available

Open benko233 opened this issue 1 year ago • 6 comments

Pre-check

  • [X] I am sure that all the content I provide is in English.

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Apache Dubbo Component

Java SDK (apache/dubbo)

Dubbo Version

OpenJDK1.8, Dubbo 3.2.14

Steps to reproduce this issue

When consumer failed to get meatadata from provider, consumer would use MetadataRetryExecutor to retry, which may cause all invocations of consumer failed.

Debug Info

  1. ServiceInstancesChangedListener uses MetadataRetryExecutor to refresh metadata when consumer lacks metadata of some instances. (ServiceInstancesChangedListener#doOnEvent) image

  2. ServiceDiscoveryRegistryDirectory refreshes routers' invokers synchronously while refreshes its invokers asynchronously. (ServiceDiscoveryRegistryDirectory #refreshInvoker) image image

  3. If metadata retry thread is interrupted, ServiceDiscoveryRegistryDirectory will fail to fresh its invokers. Meanwhile routers has refreshed invokers. image

  4. When consumer lists valid invokers for invocation, SingleRouterChain will throw an IllegalStateException because of the inconsistence of invokers. image

What you expected to happen

ServiceDiscoveryRegistryDirectory refreshes its invokers and router's consistently

Anything else

2024-07-24 20:34:08,273 [Dubbo-framework-metadata-retry-thread-1] ERROR ServiceInstancesChangedListener: -  [DUBBO] 1/2 revisions failed to get metadata from remote: 714595dd1fb0c9e0dd071fc9175f2386 837b5a0d0b7b8c414da38895d030453d , dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by unknown error in registry module, go to https:/*************************0 to find instructions. 
2024-07-24 20:34:08,273 [Dubbo-framework-metadata-retry-thread-1] WARN  ServiceInstancesChangedListener: -  [DUBBO] Address refresh try task submitted, dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by unknown error in registry module, go to https:/*************************0 to find instructions. 
2024-07-24 20:34:08,276 [Dubbo-framework-metadata-retry-thread-1] WARN  LockUtils: -  [DUBBO] Try to lock failed, dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by , go to https:/*************************0 to find instructions. 
java.lang.InterruptedException: null
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247) ~[?:1.8.0_252]
    at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:442) ~[?:1.8.0_252]
    at org.apache.dubbo.common.utils.LockUtils.safeLock(LockUtils.java:34) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.directory.AbstractDirectory.setInvokers(AbstractDirectory.java:529) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.lambda$refreshInvoker$2(ServiceDiscoveryRegistryDirectory.java:373) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.RouterChain.setInvokers(RouterChain.java:170) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.directory.AbstractDirectory.refreshRouter(AbstractDirectory.java:477) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.refreshInvoker(ServiceDiscoveryRegistryDirectory.java:373) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.refreshOverrideAndInvoker(ServiceDiscoveryRegistryDirectory.java:219) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.notify(ServiceDiscoveryRegistryDirectory.java:211) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.lambda$null$10(ServiceInstancesChangedListener.java:467) ~[dubbo-3.2.14.jar:3.2.14]
    at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597) ~[?:1.8.0_252]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.lambda$notifyAddressChanged$11(ServiceInstancesChangedListener.java:458) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.metrics.event.MetricsEventBus.post(MetricsEventBus.java:76) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.metrics.event.MetricsEventBus.post(MetricsEventBus.java:59) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.notifyAddressChanged(ServiceInstancesChangedListener.java:455) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.doOnEvent(ServiceInstancesChangedListener.java:243) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.onEvent(ServiceInstancesChangedListener.java:125) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener$AddressRefreshRetryTask.run(ServiceInstancesChangedListener.java:545) ~[dubbo-3.2.14.jar:3.2.14]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_252]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_252]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_252]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_252]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_252]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_252]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_252]
2024-07-24 20:34:08,277 [Dubbo-framework-metadata-retry-thread-1] WARN  LockUtils: -  [DUBBO] Try to lock failed, dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by , go to https:/*************************0 to find instructions. 
java.lang.InterruptedException: null
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247) ~[?:1.8.0_252]
    at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:442) ~[?:1.8.0_252]
    at org.apache.dubbo.common.utils.LockUtils.safeLock(LockUtils.java:34) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.directory.AbstractDirectory.refreshInvoker(AbstractDirectory.java:414) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.integration.DynamicDirectory.invokersChanged(DynamicDirectory.java:388) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.refreshInvoker(ServiceDiscoveryRegistryDirectory.java:386) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.refreshOverrideAndInvoker(ServiceDiscoveryRegistryDirectory.java:219) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory.notify(ServiceDiscoveryRegistryDirectory.java:211) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.lambda$null$10(ServiceInstancesChangedListener.java:467) ~[dubbo-3.2.14.jar:3.2.14]
    at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597) ~[?:1.8.0_252]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.lambda$notifyAddressChanged$11(ServiceInstancesChangedListener.java:458) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.metrics.event.MetricsEventBus.post(MetricsEventBus.java:76) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.metrics.event.MetricsEventBus.post(MetricsEventBus.java:59) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.notifyAddressChanged(ServiceInstancesChangedListener.java:455) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.doOnEvent(ServiceInstancesChangedListener.java:243) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener.onEvent(ServiceInstancesChangedListener.java:125) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener$AddressRefreshRetryTask.run(ServiceInstancesChangedListener.java:545) ~[dubbo-3.2.14.jar:3.2.14]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_252]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_252]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_252]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_252]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_252]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_252]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_252]
2024-07-24 20:34:18,273 [Dubbo-framework-metadata-retry-thread-1] WARN  ServiceInstancesChangedListener: -  [DUBBO] Received address refresh retry event, 1721824448273, dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by unknown error in registry module, go to https:/*************************0 to find instructions. 
2024-07-24 20:34:18,273 [Dubbo-framework-metadata-retry-thread-1] WARN  ServiceInstancesChangedListener: -  [DUBBO] Retrying address notification..., dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by unknown error in registry module, go to https:/*************************0 to find instructions. 
2024-07-24 20:34:18,273 [Dubbo-framework-metadata-retry-thread-1] WARN  ServiceInstancesChangedListener: -  [DUBBO] Received address refresh retry event, 1721824448273, dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by unknown error in registry module, go to https:/*************************0 to find instructions. 
2024-07-24 20:34:18,273 [Dubbo-framework-metadata-retry-thread-1] WARN  ServiceInstancesChangedListener: -  [DUBBO] Retrying address notification..., dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by unknown error in registry module, go to https:/*************************0 to find instructions. 


2024-07-24 15:03:19,103 [http-nio-8080-exec-60] ERROR SingleRouterChain: -  [DUBBO] Reject to route, because the invokers has changed., dubbo version: 3.2.14, current host: 7**********0, error code: 99-0. This may be caused by , go to https:/*************************0 to find instructions. Router's invoker size: 2 Invocation's invoker size: 2
2024-07-24 15:03:19,103 [http-nio-8080-exec-60] ERROR DynamicDirectory: -  [DUBBO] Failed to execute router: zookeeper://********/org.apache.dubbo.registry.RegistryService?REGISTRY_CLUSTER=default&application=provider&backup=********&dubbo=2.0.2&executor-management-mode=isolation&file=d***************************************************************************************************************************************************************5, cause: reject to route, because the invokers has changed., dubbo version: 3.2.14, current host: 7**********0, error code: 2-1. This may be caused by , go to https:/************************1 to find instructions. 
java.lang.IllegalStateException: reject to route, because the invokers has changed.
    at org.apache.dubbo.rpc.cluster.SingleRouterChain.route(SingleRouterChain.java:147) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.integration.DynamicDirectory.doList(DynamicDirectory.java:214) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.directory.AbstractDirectory.list(AbstractDirectory.java:232) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.support.AbstractClusterInvoker.list(AbstractClusterInvoker.java:452) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.support.AbstractClusterInvoker.invoke(AbstractClusterInvoker.java:355) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.router.RouterSnapshotFilter.invoke(RouterSnapshotFilter.java:46) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.monitor.support.MonitorFilter.invoke$original$5DvNMQFQ(MonitorFilter.java:108) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.monitor.support.MonitorFilter.invoke$original$5DvNMQFQ$accessor$oNbp53Mx(MonitorFilter.java) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.monitor.support.MonitorFilter$auxiliary$9bFbKVXI.call(Unknown Source) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86) ~[skywalking-agent.jar:8.8.0]
    at org.apache.dubbo.monitor.support.MonitorFilter.invoke(MonitorFilter.java) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.support.MetricsClusterFilter.invoke(MetricsClusterFilter.java:57) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.protocol.dubbo.filter.FutureFilter.invoke(FutureFilter.java:52) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.spring.security.filter.ContextHolderParametersSelectedTransferFilter.invoke(ContextHolderParametersSelectedTransferFilter.java:40) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.spring.security.filter.ContextHolderAuthenticationPrepareFilter.invoke(ContextHolderAuthenticationPrepareFilter.java:68) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.metrics.filter.MetricsFilter.invoke(MetricsFilter.java:86) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.support.MetricsConsumerFilter.invoke(MetricsConsumerFilter.java:38) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.support.ConsumerClassLoaderFilter.invoke(ConsumerClassLoaderFilter.java:40) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.support.ConsumerContextFilter.invoke(ConsumerContextFilter.java:119) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CopyOfFilterChainNode.invoke(FilterChainBuilder.java:349) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.filter.FilterChainBuilder$CallbackRegistrationInvoker.invoke(FilterChainBuilder.java:197) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.support.wrapper.AbstractCluster$ClusterFilterInvoker.invoke(AbstractCluster.java:101) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.support.wrapper.MockClusterInvoker.invoke(MockClusterInvoker.java:106) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.cluster.support.wrapper.ScopeClusterInvoker.invoke(ScopeClusterInvoker.java:171) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.registry.client.migration.MigrationInvoker.invoke(MigrationInvoker.java:294) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.proxy.InvocationUtil.invoke(InvocationUtil.java:64) ~[dubbo-3.2.14.jar:3.2.14]
    at org.apache.dubbo.rpc.proxy.InvokerInvocationHandler.invoke(InvokerInvocationHandler.java:81) ~[dubbo-3.2.14.jar:3.2.14]

Are you willing to submit a pull request to fix on your own?

  • [ ] Yes I am willing to submit a pull request on my own!

Code of Conduct

benko233 avatar Jul 25 '24 07:07 benko233

Will the next invocation success after registry refresh finished?

AlbumenJ avatar Jul 29 '24 02:07 AlbumenJ

image

My question is why the process is interrupted?

AlbumenJ avatar Jul 29 '24 02:07 AlbumenJ

image

My question is why the process is interrupted?

I'm not sure whether the process is interrupted by next retry. image

My connection timeout is 3000ms. If it always fails to connect, each retry will take 3000+1000+3000+1000+3000+1000=12000ms. However, the retry period is 10000ms. image

benko233 avatar Jul 30 '24 06:07 benko233

image My question is why the process is interrupted?

I'm not sure whether the process is interrupted by next retry. image

My connection timeout is 3000ms. If it always fails to connect, each retry will take 3000+1000+3000+1000+3000+1000=12000ms. However, the retry period is 10000ms. image

Actually, this is a bug. The metadata retry thread calls retryFuture.cancel(true), which will interrupt itself and then throw InterruptedException when trying lock.

benko233 avatar Jul 31 '24 03:07 benko233

image My question is why the process is interrupted?

I'm not sure whether the process is interrupted by next retry. image My connection timeout is 3000ms. If it always fails to connect, each retry will take 3000+1000+3000+1000+3000+1000=12000ms. However, the retry period is 10000ms. image

Actually, this is a bug. The metadata retry thread calls retryFuture.cancel(true), which will interrupt itself and then throw InterruptedException when trying lock.

Do you have some thoughts on how we can fix this issue?

chickenlj avatar Aug 08 '24 01:08 chickenlj

it seems no multi thread makes flow clean and clear.

    private synchronized void doOnEvent(ServiceInstancesChangedEvent event) {
        getRemoteMetadata();
        
        int emptyNum = parseMetadata();
        if (emptyNum == allInstances.size()) {
            return;
        }

        notify();

        if (emptyNum > 0) {
            doOnEvent(event);
        }
    }

laywin avatar Aug 30 '24 07:08 laywin

image My question is why the process is interrupted?

I'm not sure whether the process is interrupted by next retry. image My connection timeout is 3000ms. If it always fails to connect, each retry will take 3000+1000+3000+1000+3000+1000=12000ms. However, the retry period is 10000ms. image

Actually, this is a bug. The metadata retry thread calls retryFuture.cancel(true), which will interrupt itself and then throw InterruptedException when trying lock.

Do you have some thoughts on how we can fix this issue?

Since the process of obtaining the meta information of the service instance does not support interruption, the interruption always occurs at this place, which may cause inconsistency between the router and the invoker in the directory, and ultimately cause the router to refuse routing.

Image

Image

So, can we support interruption when obtaining the meta information of the remote service instance to avoid the inconsistent invoker?

Image

marmot-z avatar Mar 03 '25 03:03 marmot-z

image My question is why the process is interrupted?

I'm not sure whether the process is interrupted by next retry. image My connection timeout is 3000ms. If it always fails to connect, each retry will take 3000+1000+3000+1000+3000+1000=12000ms. However, the retry period is 10000ms. image

Actually, this is a bug. The metadata retry thread calls retryFuture.cancel(true), which will interrupt itself and then throw InterruptedException when trying lock.

Do you have some thoughts on how we can fix this issue?

Since the process of obtaining the meta information of the service instance does not support interruption, the interruption always occurs at this place, which may cause inconsistency between the router and the invoker in the directory, and ultimately cause the router to refuse routing.

Image

Image

So, can we support interruption when obtaining the meta information of the remote service instance to avoid the inconsistent invoker?

Image

I think this issue has been fixed in #14730

AlbumenJ avatar Mar 04 '25 01:03 AlbumenJ

I understand. Thank you for your selfless dedication.

marmot-z avatar Mar 04 '25 02:03 marmot-z