crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: check_robots_txt not working

Open mllife opened this issue 10 months ago • 7 comments

crawl4ai version

0.4.248

Expected Behavior

The library should give error for https://www.nsf.gov/awardsearch/advancedSearch.jsp as this path is not allowed to be scraped by the robots.txt https://www.nsf.gov/robots.txt here. Please fix.

Current Behavior

it is able to scraping this page, even though i have set check_robots_txt=True for CrawlerRunConfig

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets


OS

macOS

Python version

3.11.9

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

mllife avatar Feb 17 '25 11:02 mllife

@mllife I have encountered the same issue, I'm planning on making a fix and a pull request tonight, and will update you when done so you can get a quick and timely solution!

flancast90 avatar Feb 17 '25 15:02 flancast90

Oddly enough this comes from urllib RobotParser side of things. My research directed me here: https://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly. But nothing in Crawl4AI itself is handling things wrong as far as I can see - a urllib function is providing the wrong output. @mllife

For now most likely what I'll do is provide a replacement method within Crawl4AI itself which corrects this.

flancast90 avatar Feb 17 '25 16:02 flancast90

Further updates, it seems this is an open issue in cpython itself, see https://github.com/python/cpython/issues/114310. Since it has been open for over a year without any sort of resolution, I believe it may be best to find a new library to parse the robots.txt file, and am looking for a pertinent one now to circumvent this (ongoing) issue.

flancast90 avatar Feb 17 '25 16:02 flancast90

This issue is now fixed in my development environment @mllife , I will be putting out a PR shortly which you will be able to work off of in the meantime

flancast90 avatar Feb 17 '25 17:02 flancast90

@mllife Please see #708 for the fixes, let me know if you have any issues or questions. Not sure on the timeline for the PR being merged into main, but for now working off the PR should be stable enough.

flancast90 avatar Feb 17 '25 18:02 flancast90

@flancast90 Thanks for raising PR. We'll review it soon.

aravindkarnam avatar Feb 18 '25 10:02 aravindkarnam

@mllife I worked on this issue today. Feels like the problem is two fold for me.

  1. I was having SSL issues with aiohttp library throwing following exception
ClientConnectorCertificateError(ConnectionKey(host='www.nsf.gov', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)'))

And in this case when fetching the robots.txt, can_fetch simply was returning True in exception handling. This was a problem for me, but I guess this can become a problem in server environment with certificate config issues.

I'm disabling SSL verification for robots.txt requests as we're only fetching publicly available robots.txt files (no sensitive data involved), and this ensures the crawler works reliably across different environments without certificate configuration issues.

  1. It applies only problem with the standard library is with the support for wildcard. So I don't want to ditch it for any third party libraries or a full blown local implementation at this stage. So I'm just monkey patching it(to handle only wildcards). When I ran my tests it was working as expected. But need help in testing it broadly. The fix should work for both allow and disallow. Do give it a try and let me know if any issues.

aravindkarnam avatar May 07 '25 12:05 aravindkarnam