django-robots
django-robots copied to clipboard
Possible to generate incompatible format with Python RobotFileParser
I've been parsing my own robots.txt file with Python and found an interesting compatibility scenario:
If you create multiple Robot records with the same user-agent, they are spaced apart by a blank line, causing Python's RobotFileParser
to miss subsequent lines if you read it in. I'm looking at Robots v3 and Python 3.5. Is this something you'd want to change or document?
https://github.com/python/cpython/blob/3.5/Lib/urllib/robotparser.py
Example robots.txt generated:
User-agent: *
Disallow: /one
Disallow: /two
Host: example.com
The work-around is simple -- you create a single Robot record with both rules so that robots.txt has no blank line:
User-agent: *
Disallow: /one
Disallow: /two
Host: example.com
To reproduce:
from urllib.robotparser import RobotFileParser
robots = RobotFileParser('http://example.com/robots.txt')
robots.read()
robots.can_fetch(useragent='', url='/two')
It seems that when generating the robots.txt file using django-robots, it's possible to create multiple Disallow directives for the same user agent, which can result in a blank line between the directives in the generated file. This can cause compatibility issues with Python's RobotFileParser module, which may miss subsequent lines if the file is read in.
To work around this issue, you can create a single Disallow directive with multiple paths specified, as you mentioned. Alternatively, you can modify the get_robots_txt() function in django-robots to generate the robots.txt file in a format that is compatible with RobotFileParser. For example, you can modify the function to generate the Disallow directives on a single line separated by commas:
def get_robots_txt(self):
lines = [
'User-agent: *',
'Disallow: {}'.format(','.join(['/one', '/two'])),
'Host: example.com',
]
return '\n'.join(lines)