reader
reader copied to clipboard
Some websites don't have feeds
Examples:
- https://python-patterns.guide/
- https://danluu.com/ (for #239)
It should be relatively easy to have a retriever/parser pair that handles URLs like (newlines added for clarity):
magic+
http://example.com/page.html?
magic-entries=<entries anchor CSS selector>&
magic-content=<content CSS selector>
to mean:
- retrieve http://example.com/page.html
- for every link that matches
entries anchor CSS selector- create an entry from the element that matches
content CSS selector
- create an entry from the element that matches
Instead of magic-content, we could also use some library that guesses what the content is (there must be some out there).
In its best form, this should also cover the functionality of the sqlite_releases plugin. Of note is that magic-content wouldn't work here, since there's no container for the whole content; also, some of the old versions don't actually have a link.
This will also be a good test of the internal retriever/parser API we implemented in #205.
Open questions:
- what content extraction library do we use?
- https://github.com/adbar/trafilatura
- https://trafilatura.readthedocs.io/en/latest/evaluation.html
- https://github.com/scrapinghub/article-extraction-benchmark
- https://github.com/alan-turing-institute/ReadabiliPy, https://github.com/buriy/python-readability
- https://newspaper.readthedocs.io/en/latest/
- https://github.com/goose3/goose3
- after a quick look at the above, only python-readability preserves the spans used for code highlighting (we want to preserve as much HTML as possible); it also seems to have a sanitization feature (related to #157)
- how do we handle published/updated times?
- https://github.com/adbar/htmldate
- what happens if the website gets a feed?
- change_feed_url()?
Some thoughts about how to implement this in the parser:
If there are multiple things to be retrieved, we can't return them as a single file object; also, we may fabricate "composite" caching headers. I see two options:
- do all of the parsing in retriever, and return (feed, entries) directly, bypassing parser (I like this one)
- make the result.http_etag etc. a property that raises an exception if accessed before result.file is actually parsed; seems hacky; get_parser_by_url() would need to support more than exact matching
The first one would look something like this:
# RetrieveResult is renamed to FileResult, and in its place there's an union.
# RetrieverType continues to return ContextManager[Optional[RetrieveResult]]
RetrieveResult = Union[FileResult, ParsedFeed]
# class Parser:
def __call__(self, url, http_etag, http_last_modified):
parser = self.get_parser_by_url(url)
...
# Must be able to match schemes like magic+http://.
# Note that prefix match is not enough,
# magic+file.txt == file:///magic+file.txt;
# normalizing the URL beforehand could work.
retriever = self.get_retriever(url)
with retriever(url, http_etag, http_last_modified, ...) as result:
if not result:
return None
# Parsing already done, return the result (this is new).
if isinstance(result, ParsedFeed):
return result
# Continue with the old logic.
if not parser:
...
feed, entries = parser(url, result.file, result.headers)
return ParsedFeed(feed, entries, result.http_etag, result.http_last_modified)