reader Some websites don't have feeds

Some websites don't have feeds

Open lemon24 opened this issue 4 years ago • 1 comments

Examples:

https://python-patterns.guide/
https://danluu.com/ (for #239)

It should be relatively easy to have a retriever/parser pair that handles URLs like (newlines added for clarity):

magic+
http://example.com/page.html?
magic-entries=<entries anchor CSS selector>&
magic-content=<content CSS selector>

to mean:

retrieve http://example.com/page.html
for every link that matches entries anchor CSS selector
- create an entry from the element that matches content CSS selector

Instead of magic-content, we could also use some library that guesses what the content is (there must be some out there).

In its best form, this should also cover the functionality of the sqlite_releases plugin. Of note is that magic-content wouldn't work here, since there's no container for the whole content; also, some of the old versions don't actually have a link.

This will also be a good test of the internal retriever/parser API we implemented in #205.

Open questions:

what content extraction library do we use?
- https://github.com/adbar/trafilatura
- https://trafilatura.readthedocs.io/en/latest/evaluation.html
- https://github.com/scrapinghub/article-extraction-benchmark
- https://github.com/alan-turing-institute/ReadabiliPy, https://github.com/buriy/python-readability
- https://newspaper.readthedocs.io/en/latest/
- https://github.com/goose3/goose3
- after a quick look at the above, only python-readability preserves the spans used for code highlighting (we want to preserve as much HTML as possible); it also seems to have a sanitization feature (related to #157)
how do we handle published/updated times?
- https://github.com/adbar/htmldate
what happens if the website gets a feed?
- change_feed_url()?

Mar 05 '21 07:03 lemon24

Some thoughts about how to implement this in the parser:

If there are multiple things to be retrieved, we can't return them as a single file object; also, we may fabricate "composite" caching headers. I see two options:

do all of the parsing in retriever, and return (feed, entries) directly, bypassing parser (I like this one)
make the result.http_etag etc. a property that raises an exception if accessed before result.file is actually parsed; seems hacky; get_parser_by_url() would need to support more than exact matching

The first one would look something like this:

# RetrieveResult is renamed to FileResult, and in its place there's an union.
# RetrieverType continues to return ContextManager[Optional[RetrieveResult]]
RetrieveResult = Union[FileResult, ParsedFeed]

# class Parser:
def __call__(self, url, http_etag, http_last_modified):
    parser = self.get_parser_by_url(url)
    ...

    # Must be able to match schemes like magic+http://.
    # Note that prefix match is not enough, 
    # magic+file.txt == file:///magic+file.txt;
    # normalizing the URL beforehand could work.
    retriever = self.get_retriever(url)
   
    with retriever(url, http_etag, http_last_modified, ...) as result:
        if not result:
            return None

        # Parsing already done, return the result (this is new).
        if isinstance(result, ParsedFeed):
            return result

        # Continue with the old logic.

        if not parser:
            ...
        
        feed, entries = parser(url, result.file, result.headers)

    return ParsedFeed(feed, entries, result.http_etag, result.http_last_modified)

Apr 27 '21 11:04 lemon24

reader reader copied to clipboard

Some websites don't have feeds

reader
reader copied to clipboard