apache-ultimate-bad-bot-blocker
apache-ultimate-bad-bot-blocker copied to clipboard
Allow that robots.txt and perhaps other URI are accessible
I just started having a look at your project and it really looks good. The one thing I am missing is that certain resources (such as /robots.txt) should still be accessible.
i.e. Ahrefs (https://ahrefs.com/robot) honours robots.txt, but is blocked in the globalblacklist. It would be ideal that certain resources on the VirtualHost (such as /robots.txt) are still allowed for such bots.
Hi @magicdude4eva thanks for the feedback. Some feature changes in progress will allow you to whitelist bots that are listed in the bad bots section and over-ride them. As per https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/issues/34 so hang tight as these changes are in progress.
Sounds great. I still think it would be though a good idea to allow bots to access certain resources such as /robots.txt.
BTW: Good to see a fellow SA on Github (does not happen often)
Thanks @magicdude4eva and yes also great to see another fellow South African on here.
I will see what logic I can work out regarding the robots.txt. Must first get all my Travis scripts online so that the repo becomes self generating then I can work on some mods to the templates. Also once the Travis CI build scripts are in place I will also be pushing out 2 versions. One for apache 2.2 using the old access control methods and one for Apache 2.4 using the new Apache 2.4 access control methods which won't require mod_access_compat anymore as per https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/issues/32
Thanks for the input and feedback 👍 Lots of changes still coming ...... one step at a time 😀
@mitchellkrogza With Apache 2.4.26 you can use If-directives. CentOS 7 (7.3.1611) unfortunately only ships with Apache/2.4.25 (2 March 2017) so the If is a no-go for now.
The only work-around I managed to find is a negative LocationMatch which would apply the bot-filter on all resources except robots.txt - this works fine:
#########
# Block all web bot's - we are returning 403s
<LocationMatch "^/(?!robots.txt)">
include /home/apache/botblocking/globalblacklist.conf
</LocationMatch>
@magicdude4eva weird, I've tested the blocker on versions of Apache from 2.2 > 2.4.27 ??? Which version are you using Apache_2.2 or Apache_2.4 ??
I'd have to see how I can implement first allowing anything to access robots.txt (as per your example) and then moving into other sections of the blocker.
I'm busy with a lot of documentation updates right now after completing bringing all the Travis CI generator and testing scripts online. Requires a lot of doc changes due to the two distinctly different versions of Apache_2.2 (for 2.2 > 24.+ but needs module access_compat) and Apache_2.4 (no access_compat needed).
Once done with with that I am going to start working on V4 which is going to be a lot different and have quite a different layout with a switch file where users can enabled and disable certain parts of the blocker. So they could turn OFF checking for user-agents but keep ON checking for referrers. That will be a breaking change so it will probably be released in a new branch until it;s tested 100% but that's only coming in a few weeks time.
Travis is very strict with testing and both versions pass all tests. Unfortunately a lot of work right now to go and make Travis check each version against multiple versions of Apache (right now I am making Travis use 2.4.27).
Not impossible but requires a lot of scripting in the build process to install a version > test > uninstall > install another version > test > uninstall .... etc etc etc. Certainly something to consider for putting into the list of things to do at some point.