backstage icon indicating copy to clipboard operation
backstage copied to clipboard

🐛 Bug Report: v1.30.4 IdentityAuthInjectorFetchMiddleware fetch request timeout when 14k user entities are loaded into the system

Open CiscoRob opened this issue 1 year ago • 36 comments

📜 Description

After updating Backstage from 1.29.2 to 1.30.4 and running with limited set of plugins the Backstage /api/catalog queries and auth /refresh calls (using Guest module) will initially load but once the catalog is called again every fetch request will timeout.

Began to experiment by commenting out the package/backend imported plugins and removing any plugins from package/app routing and from entity details pages and tabs.

I removed loading the user catalog entry point so that only 9k~ entities were loaded into the system and running with local default configurations with yarn dev.

The site responded without issue and had no issues with any requests to the catalog API or with the IdentityAuthInjectorFetchMiddleware and PermissionClient.

It seems that once the number of user entities reaches a certain threshold Backstage begins to fail making any request and is functionally frozen. Even ctrl-c on the terminal to stop the server is non-responsive.

I've been spelunking the Backstage source in v1.30.4 and see that most of the code around where it's failing hasn't been touched in some time. Still working through but prior to removing backend-common and other libraries for the 1.30.4 migration this problem did not exist.

This behavior exists in both in-memory local testing scenarios and when deployed with Postgres / Redis resources. Starting to smell like a bad lookup / blocked promise / or maybe related to a throttling library.

👍 Expected behavior

Backstage site should not timeout on every request after loading with 21k total entities, which includes 14k user entities. Running locally with in-memory should not perform this poorly.

Seeing the same behavior in our deployed beta environment which is backed by Postgres database and Redis.

👎 Actual Behavior with Screenshots

Screenshot 2024-09-12 at 7 19 44 PM

Screenshot 2024-09-12 at 7 35 35 PM

👟 Reproduction steps

Point Backstage at a catalog of 500+ teams, 14k users, few thousand resources, and a few thousand services / systems / and a dozen or so domains.

Performance degrades to the point where the size is not usable and functionally blocked until request timeouts unblock it.

📃 Provide the context for the Bug.

Unable to use Backstage. Site is non-responsive when loading a catalog of static entity definitions. Was functionality without issue in 1.29.2 but post-migration to 1.30.4 the site has become crippled loading the same catalog. Examining performance monitors there doesn't seem to be a huge change in JS heap or obvious blocks.

🖥️ Your Environment

OS: Darwin 23.6.0 - darwin/arm64 node: v18.20.4 yarn: 4.4.0 cli: 0.27.0 (installed) backstage: 1.30.4

Dependencies: @backstage/app-defaults 1.5.10 @backstage/backend-app-api 0.8.0, 0.9.3 @backstage/backend-common 0.23.3, 0.24.1 @backstage/backend-defaults 0.4.1, 0.4.4 @backstage/backend-dev-utils 0.1.4, 0.1.5 @backstage/backend-openapi-utils 0.1.17 @backstage/backend-plugin-api 0.7.0, 0.8.1 @backstage/backend-tasks 0.5.27, 0.6.1 @backstage/backend-test-utils 0.4.4, 0.5.1 @backstage/catalog-client 1.6.5, 1.6.6 @backstage/catalog-model 1.5.0, 1.6.0 @backstage/cli-common 0.1.14 @backstage/cli-node 0.2.7 @backstage/cli 0.27.0 @backstage/config-loader 1.8.1, 1.9.0 @backstage/config 1.2.0 @backstage/core-app-api 1.14.1, 1.14.2 @backstage/core-compat-api 0.2.8 @backstage/core-components 0.14.10, 0.14.9 @backstage/core-plugin-api 1.9.3 @backstage/dev-utils 1.0.37 @backstage/e2e-test-utils 0.1.1 @backstage/errors 1.2.4 @backstage/eslint-plugin 0.1.8 @backstage/frontend-plugin-api 0.6.7, 0.7.0 @backstage/integration-aws-node 0.1.12 @backstage/integration-react 1.1.29, 1.1.30 @backstage/integration 1.13.0, 1.14.0 @backstage/plugin-api-docs 0.11.8 @backstage/plugin-app-backend 0.3.73 @backstage/plugin-app-node 0.1.24 @backstage/plugin-auth-backend-module-atlassian-provider 0.2.5 @backstage/plugin-auth-backend-module-aws-alb-provider 0.1.17 @backstage/plugin-auth-backend-module-azure-easyauth-provider 0.1.7 @backstage/plugin-auth-backend-module-bitbucket-provider 0.1.7 @backstage/plugin-auth-backend-module-cloudflare-access-provider 0.2.1 @backstage/plugin-auth-backend-module-gcp-iap-provider 0.2.19 @backstage/plugin-auth-backend-module-github-provider 0.1.21 @backstage/plugin-auth-backend-module-gitlab-provider 0.1.21 @backstage/plugin-auth-backend-module-google-provider 0.1.21 @backstage/plugin-auth-backend-module-guest-provider 0.1.10 @backstage/plugin-auth-backend-module-microsoft-provider 0.1.19 @backstage/plugin-auth-backend-module-oauth2-provider 0.2.5 @backstage/plugin-auth-backend-module-oauth2-proxy-provider 0.1.17 @backstage/plugin-auth-backend-module-oidc-provider 0.2.6 @backstage/plugin-auth-backend-module-okta-provider 0.0.17 @backstage/plugin-auth-backend-module-onelogin-provider 0.1.5 @backstage/plugin-auth-backend 0.22.12 @backstage/plugin-auth-node 0.4.17, 0.5.1 @backstage/plugin-auth-react 0.1.5 @backstage/plugin-bitbucket-cloud-common 0.2.22 @backstage/plugin-catalog-backend-module-github 0.7.2 @backstage/plugin-catalog-backend-module-scaffolder-entity-model 0.1.22 @backstage/plugin-catalog-backend 1.25.2 @backstage/plugin-catalog-common 1.0.25, 1.0.26 @backstage/plugin-catalog-graph 0.4.8 @backstage/plugin-catalog-node 1.12.4, 1.12.6 @backstage/plugin-catalog-react 1.12.2, 1.12.3 @backstage/plugin-catalog 1.22.0 @backstage/plugin-events-node 0.3.10, 0.3.8 @backstage/plugin-home-react 0.1.15, 0.1.16 @backstage/plugin-home 0.7.9 @backstage/plugin-kubernetes-backend 0.18.5 @backstage/plugin-kubernetes-common 0.8.2 @backstage/plugin-kubernetes-node 0.1.18 @backstage/plugin-kubernetes-react 0.4.2 @backstage/plugin-kubernetes 0.11.13 @backstage/plugin-org 0.6.28 @backstage/plugin-permission-backend-module-allow-all-policy 0.1.21 @backstage/plugin-permission-backend 0.5.48 @backstage/plugin-permission-common 0.7.14, 0.8.0, 0.8.1 @backstage/plugin-permission-node 0.8.0, 0.8.2 @backstage/plugin-permission-react 0.4.24, 0.4.25 @backstage/plugin-proxy-backend 0.5.5 @backstage/plugin-scaffolder-backend-module-azure 0.1.16 @backstage/plugin-scaffolder-backend-module-bitbucket-cloud 0.1.14 @backstage/plugin-scaffolder-backend-module-bitbucket-server 0.1.14 @backstage/plugin-scaffolder-backend-module-bitbucket 0.2.14 @backstage/plugin-scaffolder-backend-module-gerrit 0.1.16 @backstage/plugin-scaffolder-backend-module-gitea 0.1.14 @backstage/plugin-scaffolder-backend-module-github 0.4.2 @backstage/plugin-scaffolder-backend-module-gitlab 0.4.6 @backstage/plugin-scaffolder-backend 1.24.1 @backstage/plugin-scaffolder-common 1.5.4, 1.5.5 @backstage/plugin-scaffolder-node 0.4.10, 0.4.8 @backstage/plugin-scaffolder-react 1.11.0 @backstage/plugin-scaffolder 1.24.0 @backstage/plugin-search-backend-module-catalog 0.2.1 @backstage/plugin-search-backend-module-techdocs 0.2.1 @backstage/plugin-search-backend-node 1.3.1 @backstage/plugin-search-backend 1.5.16 @backstage/plugin-search-common 1.2.13, 1.2.14 @backstage/plugin-search-react 1.7.13, 1.7.14 @backstage/plugin-search 1.4.15 @backstage/plugin-signals-react 0.0.4 @backstage/plugin-techdocs-backend 1.10.12 @backstage/plugin-techdocs-common 0.1.0 @backstage/plugin-techdocs-module-addons-contrib 1.1.13 @backstage/plugin-techdocs-node 1.12.10 @backstage/plugin-techdocs-react 1.2.7 @backstage/plugin-techdocs 1.10.8 @backstage/plugin-user-settings-common 0.0.1 @backstage/plugin-user-settings 0.8.11 @backstage/release-manifests 0.0.11 @backstage/test-utils 1.5.10 @backstage/theme 0.5.6 @backstage/types 1.1.1 @backstage/version-bridge 1.0.8

👀 Have you spent some time to check if this bug has been raised before?

  • [X] I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

Are you willing to submit PR?

No, but I'm happy to collaborate on a PR with someone else

CiscoRob avatar Sep 13 '24 00:09 CiscoRob

Running with debug logging picking up:

[1] 2024-09-13T00:32:47.283Z rootHttpRouter error Request failed with status 500 Unable to generate legacy token for communication with the 'catalog' plugin. You will typically encounter this error when attempting to call a plugin that does not exist, or is deployed with an old version of Backstage; caused by Error: Unable to generate legacy token, no legacy keys are configured in 'backend.auth.keys' or 'backend.auth.externalAccess' type=errorHandler cause=Error: Unable to generate legacy token, no legacy keys are configured in 'backend.auth.keys' or 'backend.auth.externalAccess' name=Error stack=Error: Unable to generate legacy token for communication with the 'catalog' plugin. You will typically encounter this error when attempting to call a plugin that does not exist, or is deployed with an old version of Backstage; caused by Error: Unable to generate legacy token, no legacy keys are configured in 'backend.auth.keys' or 'backend.auth.externalAccess' [1] at <anonymous> (/Users/coderrob/go/src/backstage-collab-portal/node_modules/@backstage/backend-defaults/src/entrypoints/auth/DefaultAuthService.ts:175:17)

This seems like what https://github.com/backstage/backstage/releases/tag/v1.30.1 was mentioning within the techdocs as a bug.

Still trying to sort what / where / why this is failing.

CiscoRob avatar Sep 13 '24 00:09 CiscoRob

Screenshot 2024-09-12 at 8 35 28 PM

Bumped to

{
  "version": "1.31.0-next.2"
}

Still happening.

To see whether this exists in current Backstage create app I just created a fresh Backstage application, made sure version 1.30.4 was used, and then added a app-config.local.yaml to point the catalog to the catalog static data on my system.

It's having the same behavior. It bombs trying to fetch any data. It works the first time and then all subsequent requests just bomb or timeout.

Again, this behavior is not something that has existed in the versions prior to 1.30.x.


New app screens

Screenshot 2024-09-12 at 8 47 39 PM

Screenshot 2024-09-12 at 8 48 05 PM

Screenshot 2024-09-12 at 8 46 24 PM

CiscoRob avatar Sep 13 '24 01:09 CiscoRob

Screenshot 2024-09-13 at 12 51 24 PM

Screenshot 2024-09-13 at 12 51 43 PM

CiscoRob avatar Sep 13 '24 17:09 CiscoRob

Gave up on 1.30.x versions and downgraded to 1.29.2. Used package.json resolutions to attempt to find the correct versions of 1.30.x packages but the freezing behavior continued and without additional warnings. Used backstage-cli bump to downgrade to 1.29.2 and things are back to functioning without freezing / failing.

CiscoRob avatar Sep 13 '24 21:09 CiscoRob

We've walked through the changes in 1.30 and what you've reported here, and so far nothing stands out 🤔. Main thing in the changelog is really the node-fetch bump, is it enough to downgrade node-fetch to get this working?

Would you be able to provide a repro repo or similar? In that case we can get a bit more hands on in helping out to get this fixed.

I'm also wondering to what extends you've pinned down when exactly this lockup seems to happen, as in is it always after a specific request or scheduled job? For example if you curl the backend separately, is there a specific request that the frontend makes that triggers this?

Rugvip avatar Sep 16 '24 10:09 Rugvip

@Rugvip I took enough notes on the state of things that I should be able to take the codebase, strip down entirely, and try to emulate.

The way I was able to reproduce the behavior was:

yarn dev:debug (LOG_LEVEL=debug yarn dev)
.... wait till all the catalog data loads and then I see the timestamp shown for the next refresh ....

Once the catalog was loaded I'd click to login with the "guest" provider.

The site would load our placeholder "overview" page (not using the customizable homepage yet).

I'd then click on the "Catalog" from the left-side nav bar.

The catalog would load fine with the 2k~ catalog items and it'd do so quickly <500ms

Then I'd click on the "Teams" left-side nav bar

(for reference) Screenshot 2024-09-16 at 12 01 18 PM

(teams) Screenshot 2024-09-16 at 12 02 40 PM

(users) Screenshot 2024-09-16 at 12 04 41 PM

I'd watch the network tab during the clicks to see what was called / where it stalled. I'd see the 404 on the cookies check and it'd behave well clicking catalog or clicking teams. It was after the first loading of data it seemed to trigger an auth /refresh and the fetch would block the site.

Other times it'd be just clicking the catalog link or teams link and I'd see the by-query call being made and it'd block and the entire site becomes unresponsive. Even trying to ctrl-c the running service would not respond.

After you ctrl-c you'll see some unhandled rejections within a second or two and a number of "Shutdown hook failed, Error: IPC request 'DevDataStore.save' with ID..." messages.

In the cases of the frozen fetch it's completely non-responsive.

The behavior made it feel like it was working for a request or two and then just locked.

Grabbed some screenshots during the freeze and the endpoints that were usually impacted were:

/by-query (IdentityAuthInjectorFetchMiddleware) /refresh (ProxiedSignInIdentify.esm.js) /entities?filter=kind%3Dcomponent (IdentityAuthInjectorFetchMiddleware)

I set Chrome to stop at unhandled and handled exceptions to see if I could work backwards on what was causing it and it seemed to be within @backstage/core-components. Screenshot 2024-09-12 at 6 32 12 PM

The one thing I was starting to poke around on was the dependencies of "backend-common" and "backend-app-api" hat dependencies on "plugin-auth-node" with two different minor revisions. 0.5.x vs 0.4.x.

backend-common

backend-app-api

I started to see the backend-test-utils was dependent on the 0.4.x version and is commonly used in backend plugins.

backend-test-utils

I needed to get our beta environment out of a bad state so I used the Backstage cli to bump down to version 1.29.2 at that point and was able to get it back to operating normally.

The "backend-test-utils" is still "sus" as my kids say.

We definitely have a vested interest in discovering the cause / solution since it looks like the upgrade would be blocking. :)

I have some priority to-dos but I'll find some time to create a sharable reproduction.

From what I tried...

  • auditing each of the Backstage dependencies taking "info" lists before each / after each action
  • stripping backend index.ts to only import core backstage plugins for catalog, auth, search, permissions, and techdocs
  • assigning fixed versions using package.json resolutions in packages/app and packages/backend
    • from the info list tried older versions / newer versions
  • used guest auth, used with OIDC and no guest
  • stripped down loading of our catalog data to subset of groups, users, domains, systems, and other services/resources
  • removed yarn.lock
  • used bump with "main"
  • used bump with "next"
  • removed the "node-fetch" packages from plugins
  • running backstage-cli migrate commands (just in case...)
  • running backstage-cli repo fix / clean / list-deprecations (to see if anything new popped up)
  • using git clean -xdf to ensure no cruft was leftover in the repo to cause oddities before any changes / testing

I didn't try explicitly calling the Backend api to see if I could manually trigger the problem. I have a CLI tool that was built to do data-integrity checks and some self-healing for our Backstage data.

I have a stripped down version of Backstage there running 1.29.2 I can bump to 1.30.4 and have it pointed at the catalog. I saw similar behaviors there and it seemed to be related to the number of users loaded into the catalog that triggered the condition.

CiscoRob avatar Sep 16 '24 17:09 CiscoRob

Would you mind sharing a reproduction repo with us 🙏🏻 ? You could create a fake provider that inserts tons of entities to simulate the error in there.

camilaibs avatar Sep 19 '24 10:09 camilaibs

@camilaibs I've created some sample 50k entities on my branch https://github.com/deejay1/backstage/tree/perf which should help debug it

deejay1 avatar Sep 20 '24 09:09 deejay1

Sorry for the delay. We're dealing with a large layoff that impacted our already tiny team.

We stabilized on 1.29.2 for everything but search. Search goes from working, then not working, then working, and then not working. Flipping back and forth between HTTP 500 MissingIndexError and working without problems. It's completely erratic.

I set today aside in our sprint to create the reproduction for ya'll and to do some more deep diving.

PS: Thanks for the help and attention all.

One second... Screenshot 2024-09-20 at 8 44 10 AM

The next... Screenshot 2024-09-20 at 8 44 15 AM

Screenshot 2024-09-20 at 8 44 23 AM

And working... Screenshot 2024-09-20 at 8 47 48 AM

And not working... Screenshot 2024-09-20 at 8 48 29 AM

Screenshot 2024-09-20 at 8 48 58 AM

CiscoRob avatar Sep 20 '24 13:09 CiscoRob

See that 1.31.1 was released. Going to try bumping from 1.29.2 to 1.31.1 first to see if that resolves the problems seen with 1.30.x.

I cloned the repo with another name and will bump it to 1.30.4 and compare with 1.31.1.

Screenshot 2024-09-20 at 8 58 02 AM

🤞

CiscoRob avatar Sep 20 '24 13:09 CiscoRob

After a git clean -xdf, yarn, and a bump to 1.31.1. I cleared cache / did a hard empty / reload, and tried with incognito.

Did the yarn dev with debug enabled. Refreshed the page... annnnnnd stuck.

[1] 2024-09-20T15:24:39.081Z rootHttpRouter info [20/Sep/2024:15:24:39 +0000] "POST /api/permission/authorize HTTP/1.1" 200 74 "-" "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)" type=incomingRequest
[1] 2024-09-20T15:24:39.155Z rootHttpRouter info [20/Sep/2024:15:24:39 +0000] "GET /api/auth/v1/userinfo HTTP/1.1" 200 93 "-" "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)" type=incomingRequest
[1] 2024-09-20T15:24:39.158Z rootHttpRouter info [20/Sep/2024:15:24:39 +0000] "POST /api/permission/authorize HTTP/1.1" 200 74 "-" "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)" type=incomingRequest
[1] 2024-09-20T15:24:39.230Z rootHttpRouter info [20/Sep/2024:15:24:39 +0000] "GET /api/catalog/entities?filter=kind%3DComponent HTTP/1.1" 304 - "http://localhost:3000/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" type=incomingRequest
[1] 2024-09-20T15:24:39.254Z rootHttpRouter info [20/Sep/2024:15:24:39 +0000] "GET /api/auth/v1/userinfo HTTP/1.1" 200 93 "-" "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)" type=incomingRequest
[1] 2024-09-20T15:24:39.255Z rootHttpRouter info [20/Sep/2024:15:24:39 +0000] "POST /api/permission/authorize HTTP/1.1" 200 74 "-" "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)" type=incomingRequest

Inspected page, clicked on network, hit refresh, and it gets stuck (guest enabled / development branch).

Screenshot 2024-09-20 at 10 28 43 AM

Sat like this for a few minutes... did the inspector with network tab ^^ Screenshot 2024-09-20 at 10 26 53 AM

... after a minute or so I clicked refresh again. This time it loaded the page but then immediately blocked again.

Screenshot 2024-09-20 at 10 29 53 AM

Screenshot 2024-09-20 at 10 29 33 AM

Going to give it a few minutes, but seems like the behaviors still exist.

I'll go about stripping down to a shareable version bit by bit ensuring the bad behaviors persist between each change.

CiscoRob avatar Sep 20 '24 15:09 CiscoRob

Screenshot 2024-09-20 at 10 34 37 AM

Ok, the requests resolved after 4.2 minutes.

I noticed in the console right at the point when things resolved a whole bunch of @backstage/plugin-search-backend-module-techdocs failures started flashing into the server output:

Screenshot 2024-09-20 at 10 34 27 AM

After that it started listing a bunch of paginated requests to the catalog entities.

Screenshot 2024-09-20 at 10 37 25 AM

CiscoRob avatar Sep 20 '24 15:09 CiscoRob

The catalog was fully loaded. I clicked on an entry and saw the details page. Everything was quick.

Clicked on "APIs" .... and back to blocked

Screenshot 2024-09-20 at 10 38 26 AM

CiscoRob avatar Sep 20 '24 15:09 CiscoRob

OK, interestingly it took another 4~ minutes for the request to finish and then another flood of techdocs.

Screenshot 2024-09-20 at 10 48 53 AM

Screenshot 2024-09-20 at 10 50 03 AM

Screenshot 2024-09-20 at 10 50 47 AM

CiscoRob avatar Sep 20 '24 15:09 CiscoRob

Thank you for your quick response everybody. Could you try enabling pagination on the catalog index page and see if that helps? Here's how to do it: https://backstage.io/docs/next/features/software-catalog/catalog-customization/#pagination

camilaibs avatar Sep 23 '24 14:09 camilaibs

Hmm, I think it's enabled by default already https://github.com/backstage/backstage/blob/v1.31.1/packages/app/src/App.tsx#L122

deejay1 avatar Sep 23 '24 14:09 deejay1

Hi @deejay1, that's just the example app for this repo and not what gets used when generating a Backstage instance using npx @backstage/create-app@latest - that would be what's here: https://github.com/backstage/backstage/blob/56becbfb19b85361c43707a6553615fa1e16a360/packages/create-app/templates/default-app/packages/app/src/App.tsx#L67

awanlin avatar Sep 23 '24 15:09 awanlin

Hey all I was able to use create-app on version 1.31.1 and seems like I can reproduce it with our data. Running on a completely stripped down version.

Going to see about making some data scrubbers to keep the behaviors in data I can share for external testing.

Screenshot 2024-09-24 at 2 04 10 PM

Screenshot 2024-09-24 at 2 04 42 PM

Screenshot 2024-09-24 at 2 09 12 PM

CiscoRob avatar Sep 24 '24 19:09 CiscoRob

@freben could use your thoughts... i had some time for debugging into the setup and found something that matched the behaviors I was seeing.

When I run the site locally I stepped through and I found a dependency that looks like it's potentially related.

https://github.com/backstage/backstage/blob/2e406568d20749de27af7b2854b99c254dd824c2/plugins/catalog/src/apis/EntityPresentationApi/DefaultEntityPresentationApi.ts#L370

This DefaultEntityPresentationApi depends on dataloader version ^2.0.0. In the yarn.lock the resolution is set to using ^2.2.2.

Backstage libraries that depended dataloader on ^2.0.0:

  • @backstage/plugin-catalog
  • @backstage/plugin-permission-backend
  • @backstage/plugin-search-backend ....

https://github.com/graphql/dataloader/releases shows that some updates in 2.2.2 changed some behaviors related to how it batches and how it fails. It looks like previous to 2.2.2 the call to .load() could throw an error, which is fine since the code handles it, but previously it could launch a task unintentionally.

Wanted to call that out as something I saw checking yarn.lock differences to versions defined in package.json. Blocking batches and the state "cache" implementation in the DefaultEntityPresentationApi seemed suspect for potential "block" points.

What i see happening is the first call on the page loading for "catalog" goes to:

  1. plugin-catalog-react/hooks/useEntityListProvider CatalogClient => DefaultApi.client.ts => getEntities(...

. . . once this loads several queries fire off to request the catalog client for the same data with filters applied for the catalog page filters (should they be using the entity list instead of making catalog api calls for the same data w/ filters?) . . .

  1. plugin-catalog-react/hooks/useOwnedEntitiesCount CatalogClient => DefaultApi.client.ts => getEntitiesByQuery(...
  2. UserListPicket: plugins/catalog-react/src/components/UserListPicker/useAllEntitiesCount.ts CatalogClient => DefaultApi.client.ts => getEntitiesByQuery(...
  3. UserListPicker: plugins/catalog-react/src/components/UserListPicker/useStarredEntitiesCount.ts CatalogClient => DefaultApi.client.ts => getEntitiesByQuery(...
  4. EntityAutocompletePicker: plugin-catalog-react/components/EntityAutocompletePicker.tsx CatalogClient => DefaultApi.client.ts => getEntityFacets(...

The first page request of data seems to be fine. The data is present and interactable. There are still several "pending" promises that return loading true from useAsync for the filter data requests.

If you change the "kind" filter from component to "user" the page loads with the table in a "loading" state. It at this point just gets frozen. Stops responding. Backend server cannot be terminated by ctrl-c SIGINT. It's all locked.

The thing I noticed was that requests 2 through 5 consistently sit at pending state all get "unblocked" for a consistent delay. The duration of the pending state delay seems relative to the amount of data it's trying to load.

It is made much worse if you go from the "components" (2k~) around 4-7 seconds locked to "users" (14k~) the delay has been pretty consistent around 1.4 minutes (with no debugger attached.)

Screenshot 2024-09-25 at 4 38 54 PM

I was doing some checks again the behavior and the timings I was seeing on loading 14,500 users with a latency of around 200-500ms for round trip calls. The 20 items default "limit" on items and the 100 max requests with 50ms setTimeout on batch delay milliseconds.

If you average it out takes could take about 5~ minutes, which is what we're seeing in our cloud hosted integration environment. On local it's averaging around 1.4~ minutes give or take a few seconds because of the faster local roundtrip time.

I could be way off but I saw recentish (in < 2 months) work in that area around creating a "state" of entities along with an update to only return a limited set of fields for entities. That seems to match up with updates that went into 1.30.x version and exist there currently.

I know that assumes the cursor batching behavior when I'm not using cursor paging. I may have been chasing a red herring but it seems like something is blocking on batching requests to the backend API. The timings seems to match really closely to the default timings i saw in DefaultEntityPresentationApiOptions and just made the dataloader seem a little suspicious.

viewer.zip

Attaching the stripped down viewer I've been using to test behaviors with.

I'm sorry it's taking so long, but I'm still pending on getting a reproduction of our data that is sanitized / sharable. I'll see if i can spare some personal time this weekend to get something together.

CiscoRob avatar Sep 27 '24 18:09 CiscoRob

. . once this loads several queries fire off to request the catalog client for the same data with filters applied for the catalog page filters (should they be using the entity list instead of making catalog api calls for the same data w/ filters?) . . .

this is correct as those api calls are used to resolve the number of "owned/total/starred" entities. I feel like there is something else keeping your backend busy. Probably the request that keeps the backend busy isn't even initiated from the frontend.

vinzscam avatar Oct 01 '24 08:10 vinzscam

Well it COULD be that some of those filter calls turn out to be so expensive that they hold up backend execution.

freben avatar Oct 01 '24 13:10 freben

Well it COULD be that some of those filter calls turn out to be so expensive that they hold up backend execution.

but I guess the load will be at the database level not in node. I don't think we do something crazy that holds up resources for seconds

vinzscam avatar Oct 07 '24 11:10 vinzscam

I ran into this issue within our environment, I'm not entirely sure why, but through testing I identified this commit as the one that caused the performance issues for us. https://github.com/backstage/backstage/commit/01c673b9ddd1b8ac0d226d08464c883ba521eed2

I performed the update to our instance and rolled back only the code changes in that commit within the catalog-backend plugin, and performance returned to normal.

adamj29 avatar Oct 14 '24 13:10 adamj29

@adamj29 whaaaaaat O_o. Are you absolutely sure? 😅. We're staring at that piece of code and can't conceive any way in which that change could lead to a performance degredation.

Rugvip avatar Oct 16 '24 14:10 Rugvip

Hi,

Have you tried profiling the backend according to: https://backstage.io/docs/tooling/local-dev/profiling ?

We also have quite a big catalog with 25k+ entities and found some issues both in backstage and in our own code.

I think this happens because something is eating all the CPU and the event loop gets stuck. Do you have any info about resource usage from both the backend and the database?

drodil avatar Oct 16 '24 14:10 drodil

@adamj29 it's weird because the new code should be faster, I tried benchmarking it:

node --es-module-specifier-resolution=node ./dist/index.js
┌─────────┬───────────┬─────────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │ ops/sec     │ Average Time (ns)  │ Margin   │ Samples │
├─────────┼───────────┼─────────────┼────────────────────┼──────────┼─────────┤
│ 0       │ 'slower'  │ '3,354,487' │ 298.10813566299095 │ '±0.82%' │ 335449  │
│ 1       │ 'faster'  │ '2,196,327' │ 455.3054732213372  │ '±0.71%' │ 219633  │
└─────────┴───────────┴─────────────┴────────────────────┴──────────┴─────────┘

Code is here:

https://github.com/deejay1/benchme

tsc && node --es-module-specifier-resolution=node ./dist/index.js

deejay1 avatar Oct 16 '24 14:10 deejay1

@adamj29 whaaaaaat O_o. Are you absolutely sure? 😅. We're staring at that piece of code and can't conceive any way in which that change could lead to a performance degredation.

ran the same test again today and the performance is nearly identical, must have been something else 🙃

adamj29 avatar Oct 16 '24 14:10 adamj29

I wonder if https://github.com/backstage/backstage/pull/27167 made this better.

freben avatar Oct 23 '24 14:10 freben

@CiscoRob is it possible to check how much data those by-refs calls are returning to the frontend? I have the feeling the issue might be caused by https://github.com/backstage/backstage/pull/25366

vinzscam avatar Oct 24 '24 09:10 vinzscam

After upgrading to version 1.31.1, I encountered timeout errors like @CiscoRob mentioned in his screenshots while loading the home page and hitting endpoints, causing Backstage to load entities slowly. Initially, I considered reverting to an older Backstage version, but instead, I rolled back specific dependency @backstage/core-app-api which has IdentityAuthInjectorFetchMiddlewarepiece of code. I also downgraded related dependencies like @backstage/app-defaults, @backstage/plugin-user-settings, and others, as well as @backstage/plugin-catalog-backend, which manages API endpoints. Additionally, I reverted versions of dependencies that interact with plugin-catalog-backend.

After these changes, Backstage now loads faster and no longer encounters timeout issues.

pshenbagadelip avatar Nov 12 '24 07:11 pshenbagadelip