amphtml icon indicating copy to clipboard operation
amphtml copied to clipboard

Auto-Expire or Report Option for AMP Cached Pages from 404 Origin URLs

Open BellaAnasastasya opened this issue 7 months ago • 20 comments

Description

Currently, AMP cached pages on cdn.ampproject.org continue to appear in Google SERPs even when the origin page is completely inaccessible (i.e., returns a 404 Not Found or 410 Gone). This causes major abuse potential, especially in cases involving:

  • Hacked domains (e.g., .gov or .edu sites)
  • Blackhat SEO using AMP for parasite hosting
  • Expired or abandoned domains serving stale AMP content

The AMP cache still serves the page even after the source is long gone — creating the illusion of valid content and polluting search results.

🚀 Feature Request:

  1. Auto-refresh or invalidate AMP cache when the origin consistently returns 404/410 on fetch.
  2. Add a “Report this AMP page” button on the AMP cache viewer page (e.g., https://xxx.cdn.ampproject.org/c/s/domain.com/path) for pages whose origin no longer exists.
  3. Include a mechanism to revalidate stale cache from a source that no longer meets the serving criteria (expired cert, 404 origin, etc).

Alternatives Considered

  • Manually reporting AMP cache pages via Google Search “Feedback” system (not scalable, inconsistent results, and lacks visibility).
  • Waiting for AMP cache expiration passively, which may take weeks or months — often long after the origin has disappeared.
  • Using the “Remove Outdated Content” tool from Google Search Console, which requires user-side manual effort and is not always reliable for mass abuse cases.

Additional Context

🕵️ Example of Abuse Case:

  • AMP Cache URL: https://kasih--profit-pages-dev.cdn.ampproject.org/c/s/kasih-profit.pages.dev/judi%20slot%20microstar88/
  • Original Source: https://ppdb.man2kotabjm.sch.id/resources/?hkm=judi%20slot%20microstar88 → This returns 404 (confirmed).

These stale AMP cache entries are being widely exploited for blackhat SEO manipulation, especially via hacked government and educational websites. This undermines both the credibility of AMP as a technology and Google's index integrity.

A feature like automatic invalidation or a user-accessible report button would help significantly mitigate this abuse.

BellaAnasastasya avatar Jun 10 '25 08:06 BellaAnasastasya

thanks for the report @BellaAnasastasya. starting to investigate

erwinmombay avatar Jun 10 '25 17:06 erwinmombay

@BellaAnasastasya could you clarify your example:

AMP Cache URL: https://kasih--profit-pages-dev.cdn.ampproject.org/c/s/kasih-profit.pages.dev/judi%20slot%20microstar88/

Original Source: https://ppdb.man2kotabjm.sch.id/resources/?hkm=judi%20slot%20microstar88

The original source for the AMP Cache URL you provided should be: kasih-profit.pages.dev/judi%20slot%20microstar88 and it is still up. The amp cache url you provided and the original source have completely different domains

erwinmombay avatar Jun 10 '25 20:06 erwinmombay

Thanks for the response and follow-up. Let me clarify the situation — this is a bit more nuanced than a typical AMP cache case.

The AMP Cache URL: https://kasih--profit-pages-dev.cdn.ampproject.org/c/s/kasih-profit.pages.dev/judi%20slot%20microstar88/

did not originate from kasih-profit.pages.dev directly.

Instead, it originated from a hijacked .sch.id domain: https://ppdb.man2kotabjm.sch.id/resources/?hkm=judi%20slot%20microstar88 (which is now 404)

At the time this .sch.id page was active, it included a <link rel="amphtml" href="https://kasih-profit.pages.dev/..."> meta tag, which told Google (and AMP crawlers) that the AMP version of that page was hosted on kasih-profit.pages.dev.

That AMP URL (hosted on a different domain) was then cached by AMP Cache and still appears in Google mobile SERPs — even though the source .sch.id page is now 404.


Visual Evidence:

If you search in Google mobile:

slot site:ppdb.man2kotabjm.sch.id/resources/?hkm= or slot site:ppdb.man2kotabjm.sch.id

Image

You’ll see that the .sch.id page shows up with AMP format, pointing to the cached version: google.com/amp/s/kasih-profit.pages.dev/...


Why this matters:

This is a blackhat parasite-hosting trick:

  • Bad actors hijack a trusted .sch.id domain,
  • Inject spam content + a rel=amphtml pointing to their own domain (e.g., kasih-profit.pages.dev)
  • Google sees it, crawls it, and caches the AMP version,
  • Then they delete the original .sch.id page — but AMP Cache still serves the content

So technically, the cached AMP page still works, but its original context is fraudulent, and the origin (where the AMP link came from) is now gone.

Image


What we need:

If an AMP cache is based on a now-deleted origin (even if the AMP page is still live), there should be a way to:

  1. Invalidate the cache (because it’s detached from the original source context)
  2. Prevent these hijacked AMP pages from continuing to rank in Google Search

Important Notes:

Kindly, we’d like to emphasize that this request is not something that can be solved by simply pointing us to the “Remove Outdated Content” tool. This is a systemic issue — and trying to remove each case manually is not scalable, especially when there are hundreds or even thousands of URLs using this abuse method across many compromised domains.

Also, in countries like Indonesia, many of these government or educational institutions do not have the resources or urgency to respond quickly — so relying on them to fix or delist the content is unfortunately unrealistic.

We hope AMP can support a more automated and proactive solution for this problem, especially since the abuse is clearly growing and exploiting AMP’s own architecture.

Let us know if further clarification or technical logs are needed — we’re happy to assist.

BellaAnasastasya avatar Jun 11 '25 01:06 BellaAnasastasya

thanks @BellaAnasastasya for the explanation. Triaging and We'll look into it. I might have a few more follow up questions.

erwinmombay avatar Jun 18 '25 02:06 erwinmombay

Thankyou sir @erwinmombay

BellaAnasastasya avatar Jun 18 '25 05:06 BellaAnasastasya

Hi, I’m interested in working on this. Can I take it?

Chrysler1211 avatar Jun 20 '25 16:06 Chrysler1211

Hello sir @erwinmombay is there any update? thankyou

BellaAnasastasya avatar Jun 28 '25 12:06 BellaAnasastasya

@BellaAnasastasya Im still doing my research so I can propose a solution.

One thing I wanted to confirm though, if you search for slot site:ppdb.man2kotabjm.sch.id/resources/?hkm= or slot site:ppdb.man2kotabjm.sch.id are you still seeing it being associated to the AMP document? currently when I make this search i just get directed to https://ppdb.man2kotabjm.sch.id/plugins/?ez=JANDA+SLOT+LOGIN which is a 404 page. I'm going to need this to prove to our PM's that this is an actual issue that needs to be escalated. Appreciate your help

erwinmombay avatar Jun 30 '25 18:06 erwinmombay

I will give you example from another url in video sir

site:batuankaler.desa.id / bosku777 site:batuankaler.desa.id

https://github.com/user-attachments/assets/936e0f22-0e3f-46e1-b39a-160911246a3a

i hope this video help sir @erwinmombay , thanks

BellaAnasastasya avatar Jul 01 '25 01:07 BellaAnasastasya

@BellaAnasastasya much appreciated, will try and reproduce the video

erwinmombay avatar Jul 01 '25 19:07 erwinmombay

okay thanks @erwinmombay , and last thing, i forget to tell before, the AMP is also available when we search "bosku777" and click that url that i mention before (batuankaler.desa.id),, i made the keywords using site:.......domain.com,, just to make easy example. thanks sir

BellaAnasastasya avatar Jul 02 '25 01:07 BellaAnasastasya

Upon deeper inspection, especially after reviewing the AMP runtime file /src/document-fetcher.js, it's clear that AMP has no persistent handling of 404/410 origin errors, nor any system that re-checks whether the original page still includes the <link rel="amphtml">.

I completely understand that constant auto-refetching would be resource-intensive and unscalable. So instead of that, I’d like to propose a middle-ground solution that improves AMP cache integrity without harming performance.


✅ Updated Proposal

Core Idea: If the origin page no longer contains <link rel="amphtml"> pointing to the AMP version, and/or the page returns a 404 or 410 consistently, then AMP Cache should offer a mechanism to "refresh" or "invalidate" the AMP cache — even without requiring Search Console access.

Implementation Details:

  1. Add a "Report Stale AMP" or "Refresh AMP Cache" button on AMP viewer pages (e.g., cdn.ampproject.org URLs).
  2. When clicked, the AMP system:
    • Checks whether the original page still contains the rel=amphtml tag.
    • Optionally retries after a 1–2 day grace period to avoid false positives.
  3. If after that grace period:
    • The origin still returns 404/410 or
    • The AMP reference (rel=amphtml) is no longer present,
      → Then AMP Cache should auto-expire or invalidate the cached AMP document.

🚨 Important Clarification Regarding Security & Abuse Scenarios

This proposed system must not rely on Google Search Console ownership cancellation, unlike the current “Remove Outdated Content” process — because in many blackhat SEO cases:

  • The attacker has control of the GSC (Search Console) account due to domain hijacking or site compromise.
  • If the system allows cancellation from GSC, the attacker could simply reject all cleanup attempts.
  • Meanwhile, the legitimate domain owner may have lost access, and the cached AMP content continues to rank, enabling abuse.

That’s why this mechanism should treat the absence of <link rel="amphtml"> as a hard signal that the origin no longer supports or acknowledges the AMP page — and should not be reversible by GSC in this case.

Of course, if the origin still contains <link rel="amphtml"> but is temporarily unreachable (e.g., DDoS or hosting issue), then the AMP Cache can wait or retry before acting — since the origin clearly still intends to support AMP.


Why this matters

  • Avoids blind dependence on Search Console (which can be weaponized in abuse cases)
  • Uses a technical signal (meta tag presence + HTTP status) to determine legitimacy
  • Balances scalability with abuse prevention
  • Aligns with how “Remove Outdated Content” works, but more abuse-resilient

Happy to provide mockups or draft the revalidation logic if needed.

Thanks again for your time and support!

BellaAnasastasya avatar Jul 02 '25 04:07 BellaAnasastasya

@BellaAnasastasya thanks for that! I'll integrate this with my internal documentation. I still need to make a proposal and go through review so please bare with me.

erwinmombay avatar Jul 07 '25 21:07 erwinmombay

Hello sir @erwinmombay, any update sir?

BellaAnasastasya avatar Aug 23 '25 03:08 BellaAnasastasya

@BellaAnasastasya still conducting research. ill find time to prioritize this

erwinmombay avatar Aug 25 '25 17:08 erwinmombay

thanks sir

BellaAnasastasya avatar Aug 26 '25 00:08 BellaAnasastasya

just updating this for visibility. I am looking into the expiration logic right now. and looking into feasibility of some of the proposals.

erwinmombay avatar Sep 12 '25 01:09 erwinmombay

@BellaAnasastasya with the slot site:ppdb.man2kotabjm.sch.id example, does it look like the links 404 now? I recognize this is still an issue but trying to see if the documents expired since i was using them as an example

erwinmombay avatar Sep 12 '25 03:09 erwinmombay

please wait sir, im looking for good example that can help the case

BellaAnasastasya avatar Sep 12 '25 03:09 BellaAnasastasya

try this one sir : pinjoltogel site:sipp.pa-girimenang.go.id/search/

Image Image Image

Hope this example help

BellaAnasastasya avatar Sep 12 '25 03:09 BellaAnasastasya