feedparser icon indicating copy to clipboard operation
feedparser copied to clipboard

aiohttp and feedparser etag/modified usage

Open likewhoa opened this issue 7 years ago • 8 comments

is it possible to use aiohttp and the feedparser etag and modified headers in order to save bandwidth or is this something only supported with feedparser.parser()?

likewhoa avatar Aug 16 '17 19:08 likewhoa

#116

buhtz avatar Dec 24 '17 20:12 buhtz

When using an event-based I/O framework it is not too difficult to make the HTTP request yourself and pass the resulting response to feedparser.parse(). I use Twisted/treq to do this. There are a few tricks, though:

  1. You must synthesize a Content-Location header with the ultimate request URL (i.e., the URL after any redirects), or feedparser may not resolve relative URLs correctly.
  2. You should wrap the response body in a file-like object (StringIO) so that feedparser can't misinterpret it as a URL and try to do blocking I/O.

Many servers in the wild don't support conditional get with Etag or Last-Modified headers, so I recommend hashing the feed content and skipping parsing if it hasn't changed. Running feedparser.parse() is pretty expensive.

I've also seen pathological feeds with embedded timestamps that change on every request exist, but they're not too common, so I haven't come up with a scheme to avoid reparsing them.

twm avatar Jan 14 '18 00:01 twm

@twm Why should I use response_headers= in parse()? Do I really need it?

buhtz avatar Mar 12 '19 23:03 buhtz

@noMICROSOFTbuhtz Yes, you do. feedparser needs to inspect the headers as part of determining the feed's encoding, in addition to the Content-Location header's relevance to relative URL resolution I mentioned above. The Content-Type header is also used (and will be particularly relevant once #109, JSON Feed support, is merged).

twm avatar Mar 15 '19 00:03 twm

Ah ok, this is an important note I I would add this to the FAQ. And maybe we should remove the None default value from theh response_header= parameter to make it clear that this is a mandatory parameter. Don't think about backwards compatibility because the next official release is far away in the future. ;)

So what we need for the response header is...

  • Content-Location
  • Content-Type
  • anything else?

I ask that way because I use aiohttp currently and it is not possible to just pass the aiohttp.ClientResponse object to feedparser.parse(). Feedparser is not prepared for this and will miss some important informations: e.g. Status Code 301 redirecting etc.

buhtz avatar Mar 15 '19 09:03 buhtz

@twm are any of those steps necessary if you are using aiohttp to do the requests?

async with session.get(url) as response:
            text = await response.text()
            feed = feedparser.parse(text)

This is what I do at the moment

EDIT Apologies just read aiohttp at the top, the example uses twisted, if it isnt too much to ask, could you kindly add a short example with asyncio and aiohttp

slidenerd avatar Jan 02 '20 10:01 slidenerd

Sorry, I'm not familiar with aiohttp.

twm avatar Jan 02 '20 21:01 twm

SOLVED https://stackoverflow.com/questions/61746471/how-to-send-an-etag-or-last-modified-while-making-a-request-with-aiohttp-session

slidenerd avatar Aug 09 '20 10:08 slidenerd