dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Investigate adding Apache-level mechanism for rejecting aggressive robot crawling

Open landreev opened this issue 2 years ago • 3 comments

This is a narrow case of the overall "rate limiting" umbrella issue. This would not attempt to throttle the overall traffic to the site, or to regulate the rate of requests from "normal" users (that would be better done within the application. This mechanism would be the first line of defense, for detecting obvious bot/scripted or otherwise automated crawling - for ex., repeated calls coming from the same ip plowing through the collection page facets without pausing between calls - before it even gets to the application.

This would be doing essentially what we periodically do with custom command line scripts in our production. But third party tools should be readily available for addressing this common problem.

landreev avatar Feb 02 '23 15:02 landreev

2024/03/14

  • Currently waiting for input from @stevenwinship

cmbz avatar Mar 14 '24 19:03 cmbz

@cmbz I meant more along the lines of "waiting until we deploy 6.2 - that will include Steven's application-side rate limiting solution - and experiment with it to see if that addresses the problem at hand, thus making an Apache-level solution unnecessary".

landreev avatar Mar 14 '24 22:03 landreev

2024/03/27

  • Status currently waiting, we will see how the updates made in 6.2 will affect performance.

cmbz avatar Mar 27 '24 19:03 cmbz

2024/08/15

  • Assigning to @landreev and placing on hold. We will review again at the next monthly meeting. @landreev will let us know when the work should move forward.

cmbz avatar Aug 16 '24 00:08 cmbz