bulk-downloader-for-reddit
bulk-downloader-for-reddit copied to clipboard
[FEATURE] Option to prevent re-archiving of already archived material
- [X] I am requesting a feature.
- [X] I am running the latest version of BDfR
- [X] I have read the Opening an issue
Description
In automating this process (with BDFR-HTML) a lot of time is spent re-downloading the json files and it would be nice if those could be optionally skipped if they've already been archived. Do you think it would be possible to add a config for the archive command that would prevent it from performing the archive operation if the output file is already present or perhaps add something like the skip / exclude-id-file option for the archive action? Let me know what you think. Thanks!
Actually, during the process of downloading, bdfr has to download this information to get the metadata of the submissions for every program mode. So, downloading json files is just a matter of writing the existing data to disk or not. bdfr cannot function without getting that json.
This is semi-correct. The calls for the regular metadata are unavoidable, and yet are the most static data that is downloaded (relatively speaking, it isn't static for parts). Most of the time comes from downloading the comments, which require a lot of calls to Reddit to fill the comment trees. However, since we can't determine whether or not new comments have been added unless we parse the comment trees, there's no way to really shorten the process if you want to keep an accurate picture of Reddit, which is what the archive
command is meant to do.
The easiest way to skip re-downloading a Reddit object would be to add it to the exclusion list.
Sorry it's taken me so long to get back to this, a bunch of life stuff had come up. So, is it possible that there could be a way to skip downloading comments then?
I'm also interested in a feature like this. Youtube-dl has --download-archive which is the same concept.
With the new --comment-context option, the need for this grows more apparent, as the same json gets downloaded multiple times for multiple comments under the same post. Here simply filtering by post ID would work. Should I open a separate issue for this, or does it belong together?
Sorry I seem to have completely overlooked this issue. Adding a flag to prevent comment downloading should be simple. As for filtering the archiver based on the ID was added recently to the BDFR is awaiting release as commit f57590cf