html2text icon indicating copy to clipboard operation
html2text copied to clipboard

Adding user agent for input url

Open leesei opened this issue 13 years ago • 3 comments

I'm new to Python and glad to find this module to allow me to parse webpages. I would like suggest adding support for spoofing user agent for HTTP sources. Some webpage will return 401 when using urlopen(), e.g. http://www.google.com/patents/US5255452. Currently I'm using another Python (2.7) script to dump the output with user agent spoof for html2text:

    import urllib2
    request = urllib2.Request(url="http://www.google.com/patents/US5255452")
    # spoof user agent
    request.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko)")
    result = urllib2.urlopen(request)
    # write result .read() to file

leesei avatar Aug 08 '12 04:08 leesei

You should be able to use install_opener to do this.

aaronsw avatar Aug 09 '12 20:08 aaronsw

html2text is using urllib currently, so install_opener is not effective. my quick solution:

import urllib
urllib.URLopener.version = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16'

import html2text
html2text.main()

eadmaster avatar Nov 14 '12 23:11 eadmaster

Thanks both.

Actually I'm wondering why html2text uses urllib instead of urllib2? For backward compatibility reasons? (I'm not sure when is urllib2 added to python) I changed my copy to use urllib2 to specify the timeout value for the connection.

@aaronsw If I added an ua option, would you care to merge to the master branch ^^?

leesei avatar Jan 10 '13 18:01 leesei