crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Please respect robots.txt

Open Joshix-1 opened this issue 1 year ago • 6 comments

when crawling a website the robots.txt should be respected.

Joshix-1 avatar Sep 29 '24 16:09 Joshix-1

Sometimes circumventing censorship requires flexibility. It should be an option tho

BradKML avatar Sep 30 '24 03:09 BradKML

Sometimes circumventing censorship requires flexibility. It should be an option tho

robots.txt is not a means to censor anyone. Censorship means that you're prevented from expressing ideas or other content, and you're not expressing yourself when scraping a website.

If a website blocks a user agent in its robots.txt file, it means that the providers ask you not to scrape their website. Even wget respects this preference (by default), so it's only fair to ask a general-purpose scraping tool to at least do the same by default.

archer-321 avatar Sep 30 '24 13:09 archer-321

@Joshix-1 lol. if you don't want people to access things, then don't put them on the internet.

memoryhash avatar Sep 30 '24 17:09 memoryhash

@Joshix-1 This is on the roadmap and it will be configurable.

aravindkarnam avatar Oct 01 '24 05:10 aravindkarnam

@memoryhash there is a difference between open access (telling Disallow to get bent) vs spamming server request (respecting Crawl-delay out of courtesy), but people mix the latter with the former and that is very unfortunate. For @archer-321 censorship is not just blocking freedom to express opinions, but also stopping the freedom to archive for historical purposes (looking at Internet Archive). People will throw lawyers just to try memory-hole the public.

BradKML avatar Oct 02 '24 02:10 BradKML

@BradKML There is no point in having soft boundaries in this world. This is the internet. If server operators care/take issue with things, they can implement rate limits, client fingerprinting, user accounts and all sorts. It is naive and silly to expect people to abide by unenforceable soft boundaries. And even then, that's all pointless to anyone who actually knows what they are doing, it just gets rid of those doing small time work efforts and those with less experience and knowledge. Such is life.

memoryhash avatar Oct 02 '24 12:10 memoryhash