reader
reader copied to clipboard
Consider working around feedparser's issues
feedparser has some issues I would like to solve for reader:
- it uses too much memory
- this comment shows it's possible to do similar work with about 1/3 of what feedparser uses
- this gist shows part of the issue comes from the entire feed being read into memory and converted back and forth a few times to detect the encoding and update the XML declaration
- feedparser issues: https://github.com/kurtmckee/feedparser/issues/209, https://github.com/kurtmckee/feedparser/issues/287
- Atom summary/content is order-dependent (#262)
- there's a pull request waiting to be merged
- I found an alternative solution (see comment below)
- feedparser issue: https://github.com/kurtmckee/feedparser/issues/59
- you can't use defusedxml or lxml without monkeypatching (details in #212)
- feedparser issue: https://github.com/kurtmckee/feedparser/issues/107
Can we work around them by re-implementing parse()?
https://gist.github.com/lemon24/10ae478fafb8fc1cb091f04e0ceec03f
Done so far:
- got rid of memory issues
- fixed the summary/content order bug
Using an alternate SAX parser is just a matter of exposing the global list as an argument to parse(), so it doesn't need a proof of concept.
The plan:
- fork feedparser (temporarily), and incorporate all 3 changes
- cut pull requests for the changes, and a meta-issue explaining them (the encoding stuff, especially)
- vendor the fork with reader until the changes get merged upstream
I initially wanted to wrap feedparser (like in the gist above), and maybe publish that on PyPI. However, forking is wiser, because it increases the likelihood of actually getting the changes upstream, even if it happens in 6-12 months.
feedparser meta-issue: https://github.com/kurtmckee/feedparser/issues/296
feedparser PR for reducing memory usage: https://github.com/kurtmckee/feedparser/pull/302
maxrss for update_feeds() (1 worker), before/after ea64a42, on my database (~160 feeds), with all the feeds stale:
>>> def fn(before, after, base=0):
... return (1 - (after-base) / (before-base)) * 100
...
>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(76.8, 62.6)
18.489583333333325
>>> fn(76.8, 62.6, 28.8)
29.58333333333334
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(76.5, 57.9)
24.31372549019608
>>> fn(76.5, 57.9, 29.75)
39.7860962566845
Same, with all the feeds up-to-date:
>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(70.0, 55.8)
20.285714285714285
>>> fn(70.0, 55.8, 28.9)
34.54987834549878
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(66.3, 52.3)
21.11613876319759
>>> fn(66.3, 52.3, 30.0)
38.56749311294766
Memory measurements obtained with this script:
import sys, time, resource
def get_maxrss():
return (
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
/ 2 ** (20 if sys.platform == 'darwin' else 10)
)
print(f"before import: {get_maxrss():.1f}")
from reader import make_reader
print(f"before make_reader(): {get_maxrss():.1f}")
reader = make_reader("db.sqlite")
print(f"before update_feeds(): {get_maxrss():.1f}")
start = time.perf_counter()
reader.update_feeds()
end = time.perf_counter()
timing = end - start
print(f"after update_feeds(): {get_maxrss():.1f} (took {timing:.1f}s)")