[Bug]: check_robots_txt not working
crawl4ai version
0.4.248
Expected Behavior
The library should give error for https://www.nsf.gov/awardsearch/advancedSearch.jsp as this path is not allowed to be scraped by the robots.txt https://www.nsf.gov/robots.txt here. Please fix.
Current Behavior
it is able to scraping this page, even though i have set check_robots_txt=True for CrawlerRunConfig
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
macOS
Python version
3.11.9
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
@mllife I have encountered the same issue, I'm planning on making a fix and a pull request tonight, and will update you when done so you can get a quick and timely solution!
Oddly enough this comes from urllib RobotParser side of things. My research directed me here: https://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly. But nothing in Crawl4AI itself is handling things wrong as far as I can see - a urllib function is providing the wrong output. @mllife
For now most likely what I'll do is provide a replacement method within Crawl4AI itself which corrects this.
Further updates, it seems this is an open issue in cpython itself, see https://github.com/python/cpython/issues/114310. Since it has been open for over a year without any sort of resolution, I believe it may be best to find a new library to parse the robots.txt file, and am looking for a pertinent one now to circumvent this (ongoing) issue.
This issue is now fixed in my development environment @mllife , I will be putting out a PR shortly which you will be able to work off of in the meantime
@mllife Please see #708 for the fixes, let me know if you have any issues or questions. Not sure on the timeline for the PR being merged into main, but for now working off the PR should be stable enough.
@flancast90 Thanks for raising PR. We'll review it soon.
@mllife I worked on this issue today. Feels like the problem is two fold for me.
- I was having SSL issues with aiohttp library throwing following exception
ClientConnectorCertificateError(ConnectionKey(host='www.nsf.gov', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)'))
And in this case when fetching the robots.txt, can_fetch simply was returning True in exception handling. This was a problem for me, but I guess this can become a problem in server environment with certificate config issues.
I'm disabling SSL verification for robots.txt requests as we're only fetching publicly available robots.txt files (no sensitive data involved), and this ensures the crawler works reliably across different environments without certificate configuration issues.
- It applies only problem with the standard library is with the support for wildcard. So I don't want to ditch it for any third party libraries or a full blown local implementation at this stage. So I'm just monkey patching it(to handle only wildcards). When I ran my tests it was working as expected. But need help in testing it broadly. The fix should work for both
allowanddisallow. Do give it a try and let me know if any issues.