lazynlp
lazynlp copied to clipboard
Check robot.txt and ai.txt
Hello.
I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you.
read_disallows(url)
: takes in a url and returns the pattern object list containing all disallowed items from robots.txt of the baseUrl for the url.
I've tested it by providing "https://github.com/GrayHat12"
as input to the function
It extracted the baseurl "https://github.com"
and went on to read robots.txt using a GET
request on "https://github.com/robots.txt"
Then I used a regex to extract all disallowed urls.
Next I converted those urls to regex strings that could be compared against any url with the same baseurl (github.com)
for example :
One disallowed url is : "/*/stargazers"
I converted it to : "/[^/]*/stargazers"
compiled it to a pattern object and added it to a disallowed list which is returned by the function.
Now when you compare a url "https://github.com/chiphuyen/lazynlp/stargazers"
with pattern ""/[^/]*/stargazers""
there will be a match found using re.match
and you can choose to not crawl it.
Hope this was explanatory enough. I didn't understand the ai.txt
part in the issue though. Will be great if someone could elaborate on that. 🐰
Sorry for any issues with my pull request. I'm new to this and am hoping someone will guide me through