shlink icon indicating copy to clipboard operation
shlink copied to clipboard

Allow user agents to be customized in robots.txt

Open dhow opened this issue 10 months ago • 4 comments

Summary

An ability to read a text file that contains customization of robots.txt so that customization can be backed up or be persisted outside the docker container.

Use case

I've been been editing the module/Core/src/Action/RobotsAction.php file inside the container as I (and possibly many other people with similar needs) would like to allow Facebook's bot[1] so that as I paste Shlink's links will have article preview. But this was broken when I switched to stable-roadrunner (great image btw!) because -- obviously -- I forgot to add my robots.txt customization.

Since this feature would be pretty straight forward (as I already know which file output robots.txt content) I was thinking to add it by myself, but I'm not sure this -- externalize a part of robots.txt for user to persist container data -- is a good idea, so I would like to validate this idea with you if I can add this feature.

Thanks for the great work folks btw!

[1] Allowing Facebook's user-agent in robots.txt

User-agent: facebookexternalhit
Disallow: 

dhow avatar Apr 22 '24 11:04 dhow

A related topic has been recently discussed here https://github.com/shlinkio/shlink/discussions/2067, and while I would prefer not to expect people to customize the robots.txt by providing a file, I agree certain level of customization should be possible.

I mentioned some of the problems and history from current implementation here https://github.com/shlinkio/shlink/discussions/2067#discussioncomment-9179521, and I already put together and merged a feature to allow all short URLs to be crawlled by default, if desired https://github.com/shlinkio/shlink/pull/2107, which would result in the same you mentioned above, but for any crawler, not just facebook's specifically.

On top of that, the only missing piece would be to allow you to provide a list of user agents you want to allow, falling back to * if the option is not provided. Something in the lines of ROBOTS_ALLOW_USER_AGENTS=facebookexternalhit,Googlebot.

That said, you can already make your short URLs crawlable, with the limitation that it needs to be done one by one, hence the PR above.

acelaya avatar Apr 22 '24 12:04 acelaya

Thanks! I'll take a look at https://github.com/shlinkio/shlink/pull/2107 next time!

dhow avatar Apr 23 '24 09:04 dhow

I'm going to re-purpose this issue to specifically allow user agents to be customized in robots.txt. That plus the already existing capabilities around robots.txt should cover most use cases in a more predictable and reproducible way.

Later on, if there's still some missing capability, I'm open to discuss more improvements and features.

acelaya avatar May 13 '24 06:05 acelaya

That's cool @acelaya !! Thank you!!

dhow avatar May 13 '24 08:05 dhow

This feature is now implemented and will be part of Shlink 4.2

acelaya avatar Jul 06 '24 08:07 acelaya