backstage-plugin-s3 icon indicating copy to clipboard operation
backstage-plugin-s3 copied to clipboard

[BUG] - BucketRefreshScheduler is single node only

Open veenarm opened this issue 1 month ago • 4 comments

The default value of a scheduling service is global as per : https://github.com/backstage/backstage/blob/ebeaec3cdf7b18d241426854b47913d608e397b4/packages/backend-plugin-api/src/services/definitions/SchedulerService.ts#L146 which means that the bucket refresh is only called every X time once on a random host. Thus can be out of sync for end users if they then hit a different host that wasn't refreshed.

Not an issue if you only have one pod/node but we're now seeing issues with refreshing of buckets etc working for the host im on, but the other users are tied to another pod which isn't working.

There is a value of scope: local which can be set in https://github.com/spreadshirt/backstage-plugin-s3/blob/18955ff72b540ea784a30f9d444462f4bb4b6d02/plugins/s3-viewer-backend/src/service/S3BucketsProvider.ts#L57

This will cause the task to run on every node.

I see this being a generic solution, however the ability to have end user choose is ideal.

Why?

Some may chose to overwrite the S3BucketsProvider so that the buckets aren't stored locally, and instead saved/looked up from the CacheService (redis/memcache/infinispan etc...). This is another solution to the afermentioned issue.

I need to do some more tests, maybe scheduler is running on all nodes all the time just not at same times etc.. and staggering them... Might be more related to the API refresh only happening on a certain endpoint.

veenarm avatar Nov 12 '25 04:11 veenarm

Nah, I'm pretty sure my original statement was correct :) As it's global the refresh is happening on only one node/host and the others aren't getting that info.

veenarm avatar Nov 12 '25 04:11 veenarm

Apologies, I just realised that scope is actually available already!

The issue still exists however with how I use it with the API. The API only hits 1 node, so in our case only 1 in 6 is up to date.

My work around at the moment is to store the buckets in the CacheService, and then have another task/schedule that runs every minute to pull from cache and update the instance, that runs local so it's on every node.

Ideally the Interface could be async thus I can just use/store in the cache service rather than local isntance copy.

Interested?

veenarm avatar Nov 12 '25 07:11 veenarm

Understand the point. In our case we are only using 1 replica, so we haven't experienced such issues.

Is the local option not ensuring all hosts are updated properly? Anyways, this could lead to load issues, or inconsistencies if one or more pods have failing requests for any unknown reason.

Using the database might be an idea. But I'm not sure if that's really worth to store the information about buckets there. Otherwise yes, the cache might be the best approach. But always taking into consideration to remove the old data if not present anymore, or updating the old data. Also, I think we should take care of ensuring extension points can use this data too, or store it.

Sounds like a nice idea. Would you be willing to provide your solution with redis?

ivangonzalezacuna avatar Nov 20 '25 08:11 ivangonzalezacuna

Yup, I'll do a write up.

We extended the core classes and replaced the locan instance varialbes with the getCache implementations. Using cacheService.

We also added crypto so the content is encrypted in the cache.

veenarm avatar Nov 20 '25 10:11 veenarm