dspace-angular icon indicating copy to clipboard operation
dspace-angular copied to clipboard

Avoid excess load of bots going into search facet links on entity pages

Open bram-atmire opened this issue 7 months ago • 6 comments

Describe the bug We're seeing in search console for several of our clients that bots go into facet links on entity pages. Given that this doesn't contribute to the quality of the indexing (e.g. bots shouldn't be going there) and that processing these requests is resource intensive, we better avoid this behaviour al together.

To Reproduce Steps to reproduce the behavior:

  1. Look at search console for an actively indexed DSpace 7 site, that has entities enabled
  2. Look for the patterns in the reports of crawled urls for things like:

entities/orgunit/25913818-6714-4be5-89a6-f70c8facdf7e?f.author=Wang

Expected behavior Robots should be blocked from doing this

Proposed solution Add following disallow directive in robots.txt:

Disallow: /entities/*?f

Related work

Previously incorrectly created in the back-end Git repo as https://github.com/DSpace/DSpace/issues/9227

bram-atmire avatar Dec 12 '23 10:12 bram-atmire

Thanks @bram-atmire! I can imagine this is a huge load (like crawling search and browse as well) and an obvious win for bots that respect robots.txt. I'm wondering if Google's interpretation of the robot exclusion protocol supports wildcards such as this after path elements. It seems maybe? Have you tried it on a live site?

As a sysadmin I'd block these patterns in Apache / nginx just to be sure—as the Russian saying goes: "trust, but verify".


Side note, we have several patterns with trailing wildcards that will be ignored by Google bot.

alanorth avatar Jan 12 '24 05:01 alanorth

@alanorth As far as I cansee, as long as the wild card isn't trailing, it shouldn't be ignored.

The change in this ticket came up in an email dialogue with a representative from Google Scholar.

One site where we have it in prod: https://repository.upenn.edu/robots.txt

bram-atmire avatar Feb 05 '24 11:02 bram-atmire

Wouldn't it be useful to (additionally) use add the rel="nofollow" attribute to the anchor tags in the search filters? This way we don't have to rely on how wildcards are handled by crawlers

hutattedonmyarm avatar Mar 05 '24 12:03 hutattedonmyarm

@hutattedonmyarm if we use rel="nofollow" on search pages it would be a sign for bots to not crawl them, but they still have to load the page to read the anchor tags. In theory the robots.txt method should be better because bots can read it before.

alanorth avatar Mar 06 '24 05:03 alanorth

@alanorth Not the whole page, I was only talking about the links in search-filters.component. So the checkboxes which check/uncheck all the filters in the search results sidebar. These are implemented as links. Currently, crawlers follow them, because they're part of an entities page. But they only lead to search results

hutattedonmyarm avatar Mar 06 '24 06:03 hutattedonmyarm

@hutattedonmyarm oh yes, I was confusing the rel=nofollow with other robot instructions in head meta tags. I think you are right that we should make those links rel=nofollow.

alanorth avatar Mar 07 '24 06:03 alanorth