podcast-dl icon indicating copy to clipboard operation
podcast-dl copied to clipboard

Add option to cache `Etag`/`LastModified`

Open lightpohl opened this issue 3 years ago • 3 comments

Comments taken from #22:

The one extra thing that might make it even better isn't related to logging, but a cache of the RSS with a quick check of the Last-Modified and/or ETag could make subsequent runs much quicker for large feeds (assuming the last run fetched all items it needed). —@calebj

As for caching, it seems like a free optimization to me. If the server provides the headers that indicate cachable content, I don't see why podcast-dl shouldn't take advantage of it. Conversely, if certain values are present for Cache-Control, the client knows that it shouldn't cache anything. It's reasonable to leave it up to the server and to cache what it allows us to, and I don't think doing so changes the category of the program at all. —@calebj

Certainly. I think a good first simple version could exclusively check for the Etag save a podcast.cache.json for additional runs. I'll spin this conversation out into a separate issue for tracking. —@lightpohl

lightpohl avatar Apr 10 '21 22:04 lightpohl

One stopper is that in order to support name templating, we need to have access to the feed data in order to generate the paths/names. We could require the path be specified without templating (and default to the current working directory).

We'd open/save to a JSON file with keys that correspond to the provided URL, and save the ETag and Last-Modified headers there for now.

lightpohl avatar Apr 11 '21 04:04 lightpohl

It does seem like a chicken and egg problem, doesn't it? The podcast name needs to be known to resolve the default output folder, but there's little point in caching the feed in that folder if the RSS has to be downloaded to know the podcast name.

I think it would make the most sense to go along with user intent and hint at how to accomplish what they want. If the user requests to cache the feed and has not specified an explicit out-dir, cache the feed anyway, but also print a message along the lines of:

NOTICE: The feed has been cached as you requested, but the cache can only be
utilized if you specify the output directory without a template. For this
feed, the actual path is: /path/to/resolved/out-dir

Also, don't forget the cache-control headers. If present, those set a TTL for the cached copy that should be followed regardless of the other headers.

calebj avatar Apr 12 '21 14:04 calebj

@calebj - Good thoughts. Appreciate it!

lightpohl avatar Apr 12 '21 19:04 lightpohl

It's been a while! Taking a look at this again, I think it would be easier and a better separation of concerns to use something like curl before running podcast-dl to check if the resource has changed. You can use --etag-save and --etag-compare: https://man7.org/linux/man-pages/man1/curl.1.html

lightpohl avatar Jul 09 '23 20:07 lightpohl