Adding user agent for input url
I'm new to Python and glad to find this module to allow me to parse webpages. I would like suggest adding support for spoofing user agent for HTTP sources. Some webpage will return 401 when using urlopen(), e.g. http://www.google.com/patents/US5255452. Currently I'm using another Python (2.7) script to dump the output with user agent spoof for html2text:
import urllib2
request = urllib2.Request(url="http://www.google.com/patents/US5255452")
# spoof user agent
request.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko)")
result = urllib2.urlopen(request)
# write result .read() to file
You should be able to use install_opener to do this.
html2text is using urllib currently, so install_opener is not effective.
my quick solution:
import urllib
urllib.URLopener.version = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16'
import html2text
html2text.main()
Thanks both.
Actually I'm wondering why html2text uses urllib instead of urllib2? For backward compatibility reasons? (I'm not sure when is urllib2 added to python) I changed my copy to use urllib2 to specify the timeout value for the connection.
@aaronsw
If I added an ua option, would you care to merge to the master branch ^^?