reader Consider working around feedparser's issues

Consider working around feedparser's issues

Open lemon24 opened this issue 2 years ago • 5 comments

feedparser has some issues I would like to solve for reader:

it uses too much memory
- this comment shows it's possible to do similar work with about 1/3 of what feedparser uses
- this gist shows part of the issue comes from the entire feed being read into memory and converted back and forth a few times to detect the encoding and update the XML declaration
- feedparser issues: https://github.com/kurtmckee/feedparser/issues/209, https://github.com/kurtmckee/feedparser/issues/287
Atom summary/content is order-dependent (#262)
- there's a pull request waiting to be merged
- I found an alternative solution (see comment below)
- feedparser issue: https://github.com/kurtmckee/feedparser/issues/59
you can't use defusedxml or lxml without monkeypatching (details in #212)
- feedparser issue: https://github.com/kurtmckee/feedparser/issues/107

Can we work around them by re-implementing parse()?

Nov 18 '21 19:11 lemon24

https://gist.github.com/lemon24/10ae478fafb8fc1cb091f04e0ceec03f

Done so far:

got rid of memory issues
fixed the summary/content order bug

Using an alternate SAX parser is just a matter of exposing the global list as an argument to parse(), so it doesn't need a proof of concept.

Nov 21 '21 22:11 lemon24

The plan:

fork feedparser (temporarily), and incorporate all 3 changes
cut pull requests for the changes, and a meta-issue explaining them (the encoding stuff, especially)
vendor the fork with reader until the changes get merged upstream

I initially wanted to wrap feedparser (like in the gist above), and maybe publish that on PyPI. However, forking is wiser, because it increases the likelihood of actually getting the changes upstream, even if it happens in 6-12 months.

Nov 29 '21 14:11 lemon24

feedparser meta-issue: https://github.com/kurtmckee/feedparser/issues/296

Dec 18 '21 20:12 lemon24

feedparser PR for reducing memory usage: https://github.com/kurtmckee/feedparser/pull/302

Jan 29 '22 08:01 lemon24

maxrss for update_feeds() (1 worker), before/after ea64a42, on my database (~160 feeds), with all the feeds stale:

>>> def fn(before, after, base=0):
...     return (1 - (after-base) / (before-base)) * 100
... 
>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(76.8, 62.6)
18.489583333333325
>>> fn(76.8, 62.6, 28.8)
29.58333333333334
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(76.5, 57.9)
24.31372549019608
>>> fn(76.5, 57.9, 29.75)
39.7860962566845

Same, with all the feeds up-to-date:

>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(70.0, 55.8)
20.285714285714285
>>> fn(70.0, 55.8, 28.9)
34.54987834549878
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(66.3, 52.3)
21.11613876319759
>>> fn(66.3, 52.3, 30.0)
38.56749311294766

Memory measurements obtained with this script:

import sys, time, resource

def get_maxrss():
    return (
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
        / 2 ** (20 if sys.platform == 'darwin' else 10)
    )

print(f"before import: {get_maxrss():.1f}")

from reader import make_reader

print(f"before make_reader(): {get_maxrss():.1f}")

reader = make_reader("db.sqlite")

print(f"before update_feeds(): {get_maxrss():.1f}")

start = time.perf_counter()
reader.update_feeds()
end = time.perf_counter()
timing = end - start

print(f"after update_feeds(): {get_maxrss():.1f} (took {timing:.1f}s)")

Feb 07 '22 18:02 lemon24

reader reader copied to clipboard

Consider working around feedparser's issues

reader
reader copied to clipboard