lazynlp icon indicating copy to clipboard operation
lazynlp copied to clipboard

Check robot.txt and ai.txt

Open GrayHat12 opened this issue 3 years ago • 0 comments

Hello. I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you. read_disallows(url) : takes in a url and returns the pattern object list containing all disallowed items from robots.txt of the baseUrl for the url. I've tested it by providing "https://github.com/GrayHat12" as input to the function It extracted the baseurl "https://github.com" and went on to read robots.txt using a GET request on "https://github.com/robots.txt" Then I used a regex to extract all disallowed urls. Next I converted those urls to regex strings that could be compared against any url with the same baseurl (github.com) for example : One disallowed url is : "/*/stargazers" I converted it to : "/[^/]*/stargazers" compiled it to a pattern object and added it to a disallowed list which is returned by the function.

Now when you compare a url "https://github.com/chiphuyen/lazynlp/stargazers" with pattern ""/[^/]*/stargazers"" there will be a match found using re.match and you can choose to not crawl it.

Hope this was explanatory enough. I didn't understand the ai.txt part in the issue though. Will be great if someone could elaborate on that. 🐰

Sorry for any issues with my pull request. I'm new to this and am hoping someone will guide me through

GrayHat12 avatar Nov 11 '20 12:11 GrayHat12