datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Slow UI/Graphql endpoints after upgrade to v1.1.0

Open moenesbs opened this issue 5 months ago • 10 comments

Hello, We are running datahub on EKS on AWS, we are using opensearch and AWS MSK.

After upgrading to v1.1.0 from v0.15.0 ( I went to v1.0.0 first ), I noticed that the UI is taking ages to load. After taking a look at the network calls, I can see that mainly the graphql calls are taking a while.

I suspected that this could be linked to opensearch. First thing I noticed that the number of searchable documents in opensearch tripped after running systemUpdate pod. Is this normal? Can this increase cause this slowness?

Thanks

moenesbs avatar Jul 07 '25 21:07 moenesbs

hey @moenesbs! thanks for raising this concern. my first question for you is are you running the new UI or are you still on the old UI after your upgrade? also, do you know anything more about the number of documents in opensearch that increased? any particular entity index for example?

chriscollins3456 avatar Jul 09 '25 21:07 chriscollins3456

@chriscollins3456 I can confirm it. After upgrade to 1.1.0 and switching to the new UI rendering of the home page take almost infinite time. Upgrade to 1.2.0 has not resolve the problem. I see a lot of 502 from graphql. We deploy DataHub on GKE with Postgres CloudSQL as a backend DB. I've check metrics of resources utilization for the frontend, gms, cloud sql - everything is about 25 - 30% (CPU and RAM).

Image

Linux-oiD avatar Aug 07 '25 11:08 Linux-oiD

I see a lot of 502 from graphql.

@Linux-oiD interesting - so are those 502 errors timeout errors for you or some other sort of server errors? if they're timeouts that would seem to match the idea of this github issue here. if they're other server related issues then that might just be a problem with your upgrade and would require checking out the logs of GMS. let me know once you know!

After upgrade to 1.1.0 and switching to the new UI rendering of the home page

if you turn off the new UI do you still see this issue? or is this always an issue after your upgrade?

chriscollins3456 avatar Aug 07 '25 14:08 chriscollins3456

@chriscollins3456 yes. It's a timeout. There are no additional errors in GMS log. Switching back to old UI helps.

Linux-oiD avatar Aug 07 '25 15:08 Linux-oiD

@Linux-oiD Would you mind attaching the gms logs? 502 should leave a trace somewhere. Thanks!

benjiaming avatar Aug 13 '25 16:08 benjiaming

I also face the same issue. We've deployed datahub on our cluster and after upgrading to the latest version, gms crashes after a while when browsing the UI:

2025-08-29 13:18:03,528 [qtp393476856-184] ERROR i.d.o.c.GlobalControllerExceptionHandler:148 - Unhandled exception occurred for request: /api/graphql
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
        at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
        at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:81)
        at org.springframework.web.context.request.async.WebAsyncManager.lambda$startDeferredResultProcessing$5(WebAsyncManager.java:434)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
        at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:186)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState$2.run(ServletChannelState.java:761)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1518)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1511)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.runInContext(ServletChannelState.java:1308)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.onTimeout(ServletChannelState.java:780)
        at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:448)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1524)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.lambda$execute$0(ContextHandler.java:1541)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:981)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1211)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1166)
        at java.base/java.lang.Thread.run(Thread.java:840)
2025-08-29 13:18:03,527 [qtp393476856-193] ERROR i.d.o.c.GlobalControllerExceptionHandler:148 - Unhandled exception occurred for request: /api/graphql
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
        at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
        at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:81)
        at org.springframework.web.context.request.async.WebAsyncManager.lambda$startDeferredResultProcessing$5(WebAsyncManager.java:434)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
        at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:186)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState$2.run(ServletChannelState.java:761)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1518)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1511)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.runInContext(ServletChannelState.java:1308)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.onTimeout(ServletChannelState.java:780)
        at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:448)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1524)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.lambda$execute$0(ContextHandler.java:1541)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:981)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1211)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1166)
        at java.base/java.lang.Thread.run(Thread.java:840)
2025-08-29 13:18:03,536 [qtp393476856-196] ERROR i.d.o.c.GlobalControllerExceptionHandler:148 - Unhandled exception occurred for request: /api/graphql
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
        at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
        at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:81)
        at org.springframework.web.context.request.async.WebAsyncManager.lambda$startDeferredResultProcessing$5(WebAsyncManager.java:434)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
        at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:186)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState$2.run(ServletChannelState.java:761)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1518)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1511)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.runInContext(ServletChannelState.java:1308)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.onTimeout(ServletChannelState.java:780)
        at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:448)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1524)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.lambda$execute$0(ContextHandler.java:1541)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:981)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1211)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1166)
        at java.base/java.lang.Thread.run(Thread.java:840)
2025-08-29 13:18:03,531 [qtp393476856-188] ERROR i.d.o.c.GlobalControllerExceptionHandler:148 - Unhandled exception occurred for request: /api/graphql
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
        at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
        at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:81)
        at org.springframework.web.context.request.async.WebAsyncManager.lambda$startDeferredResultProcessing$5(WebAsyncManager.java:434)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
        at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:186)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState$2.run(ServletChannelState.java:761)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1518)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1511)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.runInContext(ServletChannelState.java:1308)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.onTimeout(ServletChannelState.java:780)
        at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:448)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1524)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.lambda$execute$0(ContextHandler.java:1541)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:981)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1211)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1166)
        at java.base/java.lang.Thread.run(Thread.java:840)
2025-08-29 13:18:03,538 [qtp393476856-178] ERROR i.d.o.c.GlobalControllerExceptionHandler:148 - Unhandled exception occurred for request: /api/graphql
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
        at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
        at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:81)
        at org.springframework.web.context.request.async.WebAsyncManager.lambda$startDeferredResultProcessing$5(WebAsyncManager.java:434)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
        at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:186)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState$2.run(ServletChannelState.java:761)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1518)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1511)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.runInContext(ServletChannelState.java:1308)
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.onTimeout(ServletChannelState.java:780)
        at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:448)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1524)
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.lambda$execute$0(ContextHandler.java:1541)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:981)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1211)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1166)
        at java.base/java.lang.Thread.run(Thread.java:840)
...
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "BatchSpanProcessor_WorkerThread-1"
2025/08/29 13:21:11 Received signal: terminated
2025/08/29 13:21:21 Killing command due to timeout.

No other relevant logs show (I've checked the elasticsearch pod as well).

It works for a few minutes, then it freezes (i guess waiting on some async calls to complete) and eventually restarts. The CPU usage and RAM go up as well (~4300m, 1.6G). I've raised the limits of the pods, but didn't work.

When I switch to the old UI, the problem does not appear.

petros94 avatar Aug 29 '25 13:08 petros94

Hey team! Any progress with this issue?

Linux-oiD avatar Sep 30 '25 09:09 Linux-oiD

1.3.0 - still same performance issue.

Linux-oiD avatar Oct 20 '25 15:10 Linux-oiD

I've been encountering the same issue over the past few months. The slow performance has pretty much made datahub unusable.

trau-sca avatar Nov 06 '25 20:11 trau-sca

We've faced performance issue in the Glossary page. DataHub's glossaryV2 react components use the getRootGlossaryTerms/getRootGlossaryNodes queries with rootGlossaryNodeWithFourLayers fragment. On our setup this query was taking pretty much time with timing out sometimes

We just rewrote the graphql query to use only one layer and dropped the description from resulting columns (because this descriptions wasn't gaining the profit for end users). Nested layers also wasn't used for rendering, because even without them expanding the node causes new request for the content of this node

Another issue arose relates to the Domains page. There is no paging like in the glossary page. And we have plenty of domains at the one layer of the tree. This request also timed out time to time. This query use parentDomainsFields fragment for all domains, but many of the result fields is not used for rendering the page. So we rewrote the query using newly created parentDomainsFieldsForList fragment with following definition


fragment parentDomainsFieldsForList on ParentDomainsResult {
    count
    domains {
        urn
        type
        ... on Domain {
            displayProperties {
                ...displayPropertiesFields
            }
            properties {
                name
            }
        }
    }
}

gallyamb avatar Nov 27 '25 19:11 gallyamb