extensions icon indicating copy to clipboard operation
extensions copied to clipboard

πŸ› [storage-resize-images] executions & exponential costs in backfill

Open TomDUVAL-MAHE opened this issue 4 months ago β€’ 48 comments

We are experiencing a critical incident related to the official Firebase storage-resize-images extension (Google). We have a function that has been running in backfill mode since July 16 to process our images. We have not made any changes to our extension configuration since July 16. However, starting on July 25, the function began to see its number of calls per minute increase in several steps, with a critical increase on August 8 between 01:00 and 03:00, screens 1–2 show the number of executions per second of the Firebase extension functions (red = image generation, blue = backfill), with a peak at 300 + 160 requests/s sustained for one hour over approximately four days.

Two functions are implicated and operate together:

  • ext-storage-resize-images-backfillResizedImages
  • ext-storage-resize-images-generateResizedImage

πŸ”§ Environment & configuration

  • Region: europe-west1
  • Memory: FUNCTION_MEMORY=2048
  • Backfill: DO_BACKFILL=true
  • Types & sizes: IMAGE_TYPE=avif IMG_SIZES=640x480,1200x800,1920x1280 IS_ANIMATED=true OUTPUT_OPTIONS={"avif":{"quality":100}} SHARP_OPTIONS={"fit":"cover"}
  • Paths: RESIZED_IMAGES_PATH=thumbnails EXCLUDE_PATH_LIST=/maps/*/thumbnails,/pois/*/thumbnails
  • Other: DELETE_ORIGINAL_FILE=false MAKE_PUBLIC=true CONTENT_FILTER_LEVEL=OFF REGENERATE_TOKEN=false
  • Impacted instances: ext-storage-resize-images-backfillResizedImages ext-storage-resize-images-generateResizedImage
  • Dataset: ~600k images (JPG/PNG) β†’ multi-size AVIF generation (~200 GB)

Important: We did not modify the extension’s code, only its parameters via the .env file. We have not yet migrated to the recently published version, exposing additional concurrency/batch parameters.

⏱️ Timeline & metrics IC = Instance Count I/s = Instances per second Jul 20 β†’ 24: IC 0,015k β†’ 0,017k ; I/s 1.5 β†’ 2 Jul 25 β†’ 29: IC 0,038k β†’ 0,073k ; I/s 3.45 β†’ 5 Jul 30 β†’ Aug 7: IC 0,160k β†’ 0,737k ; I/s 9 β†’ 60 Aug 8 (00:00) β†’ Aug 14 (shutdown): IC 0,748k β†’ 1,700/1,800k ; I/s 59 β†’ 150 ➑️ Manual shutdown by our team on Aug 14 ~17:06 (local time)

πŸ” Technical findings

  • Logs (extracts available) indicate the extension scans the entire bucket, including thumbnails despite EXCLUDE_PATH_LIST.
  • Thumbnails appear to be excluded from processing, but not from enumeration (list/scan).
  • This multiplies I/O operations as new thumbnails are created.
  • Hypothesis: combinatorial explosion of listing operations across successive thumbnail generations β†’ exponential execution growth, cost drift, and runaway behavior (night-time spikes, without any deployment or user traffic correlation).
  • No explicit execution errors are visible in the logs (on a necessarily limited sample; several billions of lines in total).

βœ… Actions already taken (client side)

  • Immediate shutdown of both functions on Aug 14
  • Freeze on extension-related deployments until the root cause is clarified

πŸ’Έ Financial impact

  • Our usage costs have spiked dramatically.

🚨 Status This incident is blocking us as a client (costs, trust, service continuity).

Image Image

TomDUVAL-MAHE avatar Aug 18 '25 10:08 TomDUVAL-MAHE

This issue does not seem to follow the issue template. Make sure you provide all the required information.

google-oss-bot avatar Aug 18 '25 10:08 google-oss-bot

Hi, thanks for opening, we're investigating and will provide updates as soon as we have them

cabljac avatar Aug 19 '25 10:08 cabljac

Can you confirm which version of the extension you're on? And which parameters you've changed?

I think I can see the bug, I'm fairly sure there's an issue with how the extension enumerates things in the backfill.

Important: We did not modify the extension’s code, only its parameters via the .env file. We have not yet migrated to the recently published version, exposing additional concurrency/batch parameters.

Can you let me know which parameters you changed? In particular if RESIZED_IMAGES_PATH and EXCLUDE_PATH_LIST changed at all?

cabljac avatar Aug 19 '25 10:08 cabljac

Hello @cabljac, To follow up on the issue on behalf of @TomDUVAL-MAHE (same company and team) and give you any extra details you might need.

The extension was installed through firebase.json with the following version and env config (see below).

To elaborate a little bit more on the severity of the issue - we're not talking about a spike from 10 to 50€, we're talking about the spike eating through our entire 2-year Google Cloud Startup Programme credit budget, exceeding it and leaving several thousand euros pending charge, triggering project suspension warnings. We're in the middle of a Firebase Support investigation to hopefully get the situation reversed and refunded, but with no resolution yet. Case number is 10373851 in case you have the ability to get in touch with them.

Thank you for looking into the issue.

As for "changing the env variables" we didn't do anything non-standard, we just followed the available parametrizations from the extension's documentation, specifying both the output folder name, and adding it to the exclusion path afterwards (we hoped the extension would be "smart" enough to ignore its own outputs, but we found out in logs that it was still scanning through the thumbnails, so we explicitly added the wildcard exclusion paths).

The extension worked well for weeks (albeit slow, hence the performance ticket we created), but not raising any suspicion until the spike occured. We didn't yet have a chance to use the new backfillBatchSize parameter exposed in that intervention.

You can find our entire unmodified config below:

"extensions": {
    "storage-resize-images": "firebase/[email protected]"
 }


CONTENT_FILTER_LEVEL=OFF
DELETE_ORIGINAL_FILE=false
DO_BACKFILL=false
firebaseextensions.v1beta.function/location=europe-west1
FUNCTION_MEMORY=2048
IMAGE_TYPE=avif
IMG_SIZES=640x480,1200x800,1920x1280
IS_ANIMATED=true
MAKE_PUBLIC=true
OUTPUT_OPTIONS={"avif":{"quality":100}}
REGENERATE_TOKEN=false
RESIZED_IMAGES_PATH=thumbnails
SHARP_OPTIONS={"fit": "cover"}
EXCLUDE_PATH_LIST=/maps/*/thumbnails,/pois/*/thumbnails
IMG_BUCKET=XXXXXXX (redacted)

maciejgoscinski-latrace avatar Sep 04 '25 12:09 maciejgoscinski-latrace

Hi @maciejgoscinski-latrace , thanks for the follow up.

We have reproduces this issue and a PR in the works here to resolve it.

We appreciate your patience so far, and will try to get this fix fully tested and released as soon as possible.

cabljac avatar Sep 04 '25 13:09 cabljac

Hi @maciejgoscinski-latrace @TomDUVAL-MAHE

I'm taking a closer look at this issue now - I don't think we actually reproduced it previously.

Taking a deeper look, I don't yet understand how the exponential growth would be happening, as images shouldn't be processed recursively due to metadata filtering.

I'll continue to investigate and try to reproduce. Thanks for your patience.

cabljac avatar Sep 08 '25 09:09 cabljac

This feels like it could be some kind of cascade effect with retries of functions at this scale, rather than exponential enumeration. Continuing to investigate.

cabljac avatar Sep 08 '25 10:09 cabljac

Great! Let us know if we can provide any more valuable information to help you.

maciejgoscinski-latrace avatar Sep 08 '25 10:09 maciejgoscinski-latrace

@maciejgoscinski-latrace thanks!

I'm wondering if it's something to do with parallel backfills happening, because of reconfiguring during a backfill? Did you reconfigure during the backfill? I think maybe the extension currently isn't smart enough to cancel a backfill in flight, this might have compounded things

I do think that enumeration of ALL objects in storage is something we need to fix as well, and is probably part of the problem here.

cabljac avatar Sep 08 '25 10:09 cabljac

This is a very likely scenario - I think we did change (and surely intended to change further) the parameters during the backfill, at least the one to add our thumbnail folders wildcard patterns to the exclude list. We were also going to change the config for allowed parallel threads to combat performance issues, but never got to try the newer version yet. This was because at the time, the backfill would've taken half a year to complete - something that seemed unusual for less than 300 GB of images (total size of the bucket, including thumbnails now).

maciejgoscinski-latrace avatar Sep 08 '25 11:09 maciejgoscinski-latrace

Yeah, the enumeration is definitely an issue, and the fact that the extension doesn't cancel an in-flight backfill would compound it I think -

Currently with one backfill process:

N images, with 3 output formats/sizes -> N + 3N enumerations + 3N generateResizedImage invocations = 7N

but with two concurrent backfills, both are making each other enumerate, so potentially:

2N + 12N enumerations + 6N generateResizedImage = 14N

in general i think it's potentially O(b^2) invocations, where b is the number of concurrent backfills

So there are at least two issues at play here, the enumeration and the backfill being unaware of the state of backfills.

EDIT: because they will be writing to the same object names it shouldn't be O(b^2N) invocations but just O(b N)

cabljac avatar Sep 08 '25 11:09 cabljac

@maciejgoscinski-latrace

This is a very likely scenario - I think we did change (and surely intended to change further) the parameters during the backfill, at least the one to add our thumbnail folders wildcard patterns to the exclude list.

Though this would have happened around July 16 rather than later (July 25) when there was a big jump?

cabljac avatar Sep 08 '25 11:09 cabljac

You can see the invocations graph here Image

maciejgoscinski-latrace avatar Sep 08 '25 15:09 maciejgoscinski-latrace

@maciejgoscinski-latrace Thanks, do you have average execution time as well?

It is strange that we have so many successful invocations over such a period, that's ~80M invocations! something weird is happening for sure.

I am continuing to try and reproduce something like this in test projects.

cabljac avatar Sep 08 '25 15:09 cabljac

at least the one to add our thumbnail folders wildcard patterns to the exclude list.

ah just for context, did those thumbnail folders have lots of images in to start with? or were they empty?

cabljac avatar Sep 08 '25 15:09 cabljac

Could you see if you can provide logs that show when/how many times the extensions were reconfigured?

You may have to modify it, but a query like this in Log Explorer may get them:

(
  (resource.type="cloud_function" 
   resource.labels.function_name="ext-storage-resize-images-backfillResizedImages" 
   resource.labels.region="europe-west1"
   logName="projects/*/logs/cloudaudit.googleapis.com%2Factivity"
   protoPayload.methodName="google.cloud.functions.v1.CloudFunctionsService.UpdateFunction")
  OR
  (resource.type="cloud_function" 
   resource.labels.function_name="ext-storage-resize-images-generateResizedImage" 
   resource.labels.region="europe-west1"
   logName="projects/*/logs/cloudaudit.googleapis.com%2Factivity"
   protoPayload.methodName="google.cloud.functions.v1.CloudFunctionsService.UpdateFunction")
)
severity="NOTICE"
timestamp>="2025-07-16T00:00:00Z"
timestamp<="2025-08-14T18:00:00Z"

cabljac avatar Sep 09 '25 08:09 cabljac

Hello! Here are the following 4 metrics:

  • costs approaching 3k € a day
  • active instances approaching 60 and increasing until we shut the function down
  • 99th percentile executions around 1 minute
  • executions /s
Image Image Image Image

maciejgoscinski-latrace avatar Sep 10 '25 14:09 maciejgoscinski-latrace

Hi!

Thanks for the extra metrics.

Could you see if you can provide logs that show when/how many times the extensions were reconfigured?

Are you able to provide the logs I mentioned, using the query provided (you may need to edit it slightly)? It would help me trying to reproduce.

As an update i'm currently trying to reproduce with 600k images, not managed to with small images (and the same config) so I'll attempt next with larger images.

Image

cabljac avatar Sep 10 '25 14:09 cabljac

Meanwhile, thank you @cabljac for taking care of our issue. We're looking forward to providing you all the rest of the data, including the filtered logs, and we'd be very happy to have a chance to share larger volumes of diagnostics data with you over some safer channels.

In the meantime our efforts are primarily aimed at successfully getting the GCP Support to cancel the charge on our account. It seemed we've been having some progress going through Firebase support, until we received this message today, which totally disregards everything that's been said so far and just attributes it to fair usage pricing that we should pay for:

Image

From the spike, around 20k € have been subtracted from our platform credits, with 10k € extra charge pending on top of that that we have a deadline to pay until. Realistic price for converting 300 GB of images in a matter of days would be south of 30€, not 30 thousand. It's understandably crucial for us to first guarantee that the charge is dropped, before it becomes even more complicated to request a refund for a closed & paid invoice.

We would be very grateful if you could clearly confirm that from what we've learned so far, this is an abnormal behaviour of the extension, and it's not reasonable to expect us to pay the charge this high? It will be easier for us to help you investigate for as long as necessary if we first get the Billing situation sorted.

Thank you for your understanding and cooperation.

maciejgoscinski-latrace avatar Sep 10 '25 14:09 maciejgoscinski-latrace

@cabljac I attempted to check the logs (function name and region are correct) using the snippet you provided, but no results were returned in the logs explorer). Is there anything about the resourceType, logName or methodName that could require a different value in our case?

maciejgoscinski-latrace avatar Sep 10 '25 15:09 maciejgoscinski-latrace

I think the key part of the query would be protoPayload.methodName="google.cloud.functions.v1.CloudFunctionsService.UpdateFunction"

the function name resource.labels.function_name="ext-storage-resize-images-backfillResizedImages" would depend on your extension instance ID, this one is just the default

cabljac avatar Sep 10 '25 18:09 cabljac

We would be very grateful if you could clearly confirm that from what we've learned so far, this is an abnormal behaviour of the extension, and it's not reasonable to expect us to pay the charge this high? It will be easier for us to help you investigate for as long as necessary if we first get the Billing situation sorted.

I've escalated this with the extensions team at Google for you

cabljac avatar Sep 11 '25 08:09 cabljac

@maciejgoscinski-latrace while we work on debugging the issue, can you create a support case with Google Cloud support regarding the spend increase? they'll investigate

i14h avatar Sep 11 '25 15:09 i14h

Hello gentlemen, thank you very much for your activity in this topic.

Regarding the logs you requested:

  • I'm still unable to produce anything in the logs, even though the name of our function / region has been checked and confirmed to be correct

Regarding the tickets already created with Google Support:

  • First Google Billing Support: 15/08, Case 62189403 -> resulted with a 30 day payment date extension.
  • Second Google Billing Support: 18/08, Case 62273820 -> 62338424 (tech team included), resulted with the rejection I pasted you before, which completely disregarded all context and discussion we had so far. I asked for reopening today, to postpone the payment by another 30 days until fully understood and refunded.
  • Firebase Support Case 10373851, 26/08, ongoing but openly limited in terms of only trying to cancel the pending charge, not refund the credits.
  • we're also in touch with Google's territory manager for France, and with you guys here on Github

maciejgoscinski-latrace avatar Sep 11 '25 15:09 maciejgoscinski-latrace

Hello gentlemen, little update on our side.

  1. Reproduction: if you have different ideas how we could extract some useful information from the logs to help you investigate, please let us know. Is there a particular textual string we could expect to see in the plain Function Logs, to indicate the moment when the extension was reconfigured with different params?

  2. Ongoing support: Current situation is significantly worse than 1 month ago - the combined efforts from the Support teams were mostly ineffective and directed at fighting the bureaucracy / technicalities such as cross-collaboration between various departments, closing/reopening different ticket numbers / getting approvals from the higher-ups. During this time, they failed to perform another payment date extension that they promised to do in 24h. As a result, our project was suspended during the weekend, necessitating an emergency intervention, and we were forced to pay to restore it. Now the Support is telling us that (part of the) adjustment has already been made, but that doesn't stop Google from charging money from us, i.e. 5000€ they charged today, for a total of over 11000€ already. That complicates our situation because now we have to ask for a physical refund instead of just a charge cancellation. The topic of restoring the lost 22000 € platform credits hasn't been addressed at all yet.

Honestly it paints a very grim picture, and with a single case like this taking over a month to solve with Support (expecting it to take even more now, as it got more complex with time), I would strongly advise you to take down the extension from the market, prevent new installations, and issue warnings to existing users, before it happens to anyone else.

maciejgoscinski-latrace avatar Sep 18 '25 11:09 maciejgoscinski-latrace

Hi @maciejgoscinski-latrace - I will discuss this with the team today. We could remove the backfill feature without removing the extension entirely.

It's bizarre as we still haven't been able to reproduce the issue, even with 600k+ images.

cabljac avatar Sep 23 '25 15:09 cabljac

As for reproduction - I will put together some queries you can try and run to extract more info on the deployments.

cabljac avatar Sep 23 '25 15:09 cabljac

@cabljac thanks a lot. We'll do our best to give you as much information as you need, to help you reproduce and understand the cause. If you'd like, we could be available for a call, with a live session & direct access to our cloud environment.

maciejgoscinski-latrace avatar Sep 23 '25 15:09 maciejgoscinski-latrace

@maciejgoscinski-latrace sorry that you're experiencing this. We engineers have no control over the financial aspect but will do our best to identify what is the underlying issue. I'll talk to support about setting up a live session to drill into your logs. In the meantime, any thing that helps us reproduce the issue is greatly appreciated.

i14h avatar Sep 23 '25 15:09 i14h

Hello @i14h . You can find the latest answer from Support (they're really dropping the ball on this one) below. At this point it looks like they're trying to tell us that the charge was justified because we misconfigured the extension. If they have any useful technical information regarding the investigation, they haven't yet shared it with you or with us. We're still waiting for them to point HOW exactly we misconfigured the extension, in a way that resulted with 1000x overpriced spending. I find it hard to believe they actually do know what happened, but here's what they said:

Hello,
Thank you for your patience as we work to clarify the details of your account and the recent charges.
We want to ensure you have a complete picture. 
Our approach has been guided by the information we received from your Account Team Manager, 
who stated early on that if the usage was determined to be a misconfiguration on the customer's end, an adjustment would not be permitted.This is why we focused on the technical team's investigation. 
Their final report confirmed that the charges were a result of your configuration, not a system issue.
We then immediately shared this finding with your Account Team Manager and with you directly (by sending the technical team's full response).
We were then informed by your Account Team Manager that the technical team had already processed an adjustment for you. Based on this, and the fact that you had the investigation details, we believed your request for an adjustment had been resolved.
We sincerely apologize if our actions led to a misunderstanding regarding the status of your account.
Regarding the payment extension, our team was waiting for the necessary approval from your Account Team Manager before we could proceed.
We understand that navigating these details can be frustrating.
Our commitment remains to provide you with all the information regarding the cause of the charges, the available adjustment methods (including the goodwill adjustment we offered), and the status of your billing.
There appears to have been a miscommunication as information moved between teams.
To resolve your remaining inquiries effectively, I suggest we focus on one clear item at a time.
Please reply to this email with your outstanding question, and I will personally manage this ticket until it is fully resolved. I look forward to your response.
Warm Regards, Gab

maciejgoscinski-latrace avatar Sep 24 '25 12:09 maciejgoscinski-latrace