ai.robots.txt icon indicating copy to clipboard operation
ai.robots.txt copied to clipboard

Does blocking facebookexternalhit also break sharing to social media?

Open njt1982 opened this issue 1 year ago • 2 comments

https://github.com/ai-robots-txt/ai.robots.txt/blob/6b8d7f5890d6bed722a95297996c054c210bd3b8/robots.txt#L33

If we block this in robots.txt, will we affect the functionality for when URLs are shared to facebook from the site and Facebook sends that bot to get the Open Graph data for things like title and image for the post?

Ideally I'd like to block the bot from crawling / DoS'ing the site but still allow on-demand/cached page requests for OG data when a post is shared. Facebook does not need to crawl an entire site! :)

njt1982 avatar Sep 23 '24 18:09 njt1982

It will, per Meta's docs:

The primary purpose of FacebookExternalHit is to crawl the content of an app or website that was shared on one of Meta’s family of apps, such as Facebook, Instagram, or Messenger.

I'm ok blocking Meta's products generally, but I'd weigh how important that functionality is against the traffic that you're seeing from that crawler.

cdransf avatar Sep 28 '24 21:09 cdransf

Facebook sends that bot to get the Open Graph data for things like title and image for the post

Although I don't use Facebook, I'd be surprised if sharing a link on Facebook required the crawler to run. Crawlers typically run on their own schedule rather than synchronously to a user action such as creating a post. Have you experimented by monitoring access of your website by such a crawler and then posting on Facebook and seeing if a "crawl" occurs before the post is available?

glyn avatar Sep 29 '24 08:09 glyn

I've noticed that blocking facebookexternalhit prevents rich links (cards) from displaying in Apple Messages and Apple Mail (iOS and macOS). The change takes place immediately: blocking facebookexternalhit immediately prevents adding rich link, allowing immediately permits adding a rich link.

paulrudy avatar Oct 14 '24 02:10 paulrudy

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

glyn avatar Oct 16 '24 10:10 glyn

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

Is this not it? https://github.com/ai-robots-txt/ai.robots.txt/blob/b1491d269460ca57581c2df7cf14b3f3fc4749f3/robots.txt#L36

(if not, what's that list for?)

njt1982 avatar Oct 16 '24 10:10 njt1982

Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.

On Wed, 16 Oct 2024, 11:35 Nicholas Thompson, @.***> wrote:

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

Is this not it?

https://github.com/ai-robots-txt/ai.robots.txt/blob/b1491d269460ca57581c2df7cf14b3f3fc4749f3/robots.txt#L36

(if not, what's that list for?)

— Reply to this email directly, view it on GitHub https://github.com/ai-robots-txt/ai.robots.txt/issues/40#issuecomment-2416416076, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXF2OHKT6RNN5AHFC7NY3Z3Y6PTAVCNFSM6AAAAABOWU5CTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWGQYTMMBXGY . You are receiving this because you commented.Message ID: <ai-robots-txt/ai .@.***>

glyn avatar Oct 16 '24 10:10 glyn

Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.

It's probably worth alphabetising them; as the list grows, duplicates are more likely.

Could be a github pre-commit command that sorts / uniques the list? 🤷🏻‍♂️

EDIT: I realise this is massively off topic, though.

Back to the thread point... isn't recommending blocking this bot a risky move as it could cause websites to lose rich social media embedding (eg image and other OpenGraph data)?

njt1982 avatar Oct 16 '24 16:10 njt1982

I would say the priority of this project is to block AI crawlers. It is not clear that facebookexternalhit gathers data for AI training, but we don't know it isn't either. I would personally vote to keep this in the list. The possible downsides seem negligible from my perspective. Any websites that really need not to block that user agent don't have to.

glyn avatar Oct 17 '24 08:10 glyn

This particular one is a tricky one as it is a very aggressive crawler... but, unlike some of them, a lot of our customers would likely notice if their website suddenly stopped displaying article image cards.

Maybe the solution here is a comment above it describing what it does and what the risks are?

Some of these will have a much lower risk profile... But on the flip side items like this (and potentially the Google ones, too) might have unintended impact for site SEO and Social exposure for those simply copy-pasting a list in to their site to try to stop these bots from taking down the server.

njt1982 avatar Oct 17 '24 09:10 njt1982

Maybe the solution here is a comment above it describing what it does and what the risks are?

I like the sound of this. Would it be possible to demonstrate that these risks are real and not imaginary?

(I submitted a PR to extend the FAQ to take into account your "taking down the server" point - thanks!)

glyn avatar Oct 17 '24 11:10 glyn

https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/

mhattersley avatar Nov 18 '24 17:11 mhattersley

That link was helpful, thanks.

Also Dark Visitors classifies this crawler as a "fetcher" which we currently exclude.

PR raised to fix this issue.

glyn avatar Nov 19 '24 03:11 glyn

https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/

Basing our decision on FB's own statement is assuming they're being honest about the purpose of this crawler, which they are not.

As I mentioned in the original PR adding this line (#22), we saw very aggressive crawling from this UA around the time Meta launched its AI offerings.

So it looks like a case of Meta intentionally repurposing an existing UA to create a situation where anyone blocking the UA's AI use also gets punished by having to block the UA's legitimate use as shared by the OP.

You can say this is to date the most passive-aggressive attack by a company on robots.txt. And we should probably take a stand against this behaviour and include the UA but with a comment.

nisbet-hubbard avatar Dec 06 '24 23:12 nisbet-hubbard

Image

They're abusing this bot for other purposes. My website isn't shared that much.

jperezr21 avatar Jun 02 '25 19:06 jperezr21

@njt1982 @jperezr21 thanks for sharing that info.

And we should probably take a stand against this behaviour and include the UA but with a comment.

I agree. Perhaps we should submit a new PR? I'd do it but I'm not sure what to add for the JSON keys "respect", "frequency", and "description". Here's what I've put in my own file:

{
  "facebookexternalhit": {
    "operator": "Meta/Facebook",
    "respect": "[Yes](https://developers.facebook.com/docs/sharing/bot/)",
    "function": "Ostensibly for sharing content, but likely used as AI crawler as well",
    "frequency": "Unclear at this time.",
    "description": "\"The primary purpose of FacebookExternalHit is to crawl the content of an app or website that was shared on one of Meta’s family of apps, such as Facebook, Instagram, or Messenger.\" However, see discussion at https://github.com/ai-robots-txt/ai.robots.txt/issues/40#issuecomment-2932183222"
  }
}

paulrudy avatar Jun 02 '25 21:06 paulrudy

Perhaps we should submit a new PR?

By all means. I noted in #21 that

We also observed that it didn’t double-check robots.txt before starting another round of binge crawling.

So perhaps No for respect?

nisbet-hubbard avatar Jun 07 '25 00:06 nisbet-hubbard

Perhaps we should submit a new PR?

By all means. I noted in #21 that

We also observed that it didn’t double-check robots.txt before starting another round of binge crawling.

So perhaps No for respect?

This issue is closed. Please raise a new issue or PR to capture any missing changes.

glyn avatar Jun 07 '25 06:06 glyn

New PR: #154

paulrudy avatar Jun 15 '25 23:06 paulrudy

Generating a preview for social media just needs the

section, so you can just stop sending the page once you're done with that section

You can likely hack up your app to early return when you have enough for the parts...

Or you can hang up when answering from cache if u nasty

guest20 avatar Jun 16 '25 01:06 guest20

@guest20 thanks for the idea. It inspired me to serve an empty body to the bot for pages that already have OpenGraph metadata.

paulrudy avatar Jun 16 '25 23:06 paulrudy