bulk-downloader-for-reddit icon indicating copy to clipboard operation
bulk-downloader-for-reddit copied to clipboard

[FEATURE] Posts after date flag

Open ch8zer opened this issue 2 years ago • 4 comments

  • [X] I am requesting a feature.
  • [X] I am running the latest version of BDfR
  • [X] I have read the Opening an issue

Description

It would be nice if there was a flag such as --max-age 2d which indicates to skip posts older than 2 days.

For example, let's say I scrape a subreddit that has 3 posts, and I run it with --max-age 2d given that today is "2022-01-15"

POST AGE Downloaded
A 2022-01-01 no
A 2022-01-12 no
A 2022-01-13 yes
A 2022-01-14 yes
A 2022-01-15 yes

This feature can be used in a large number of use cases. One example is if past downloads are deleted (say, to save disk space), it prevents re-downloading the posts again.

ch8zer avatar Oct 12 '22 00:10 ch8zer

While "possible" with a re-write of some things, you can use --time for the --sort options it's available with but your best bet is probably to use --exclude-id-file with these so that nothing would be redownloaded.

OMEGARAZER avatar Oct 13 '22 21:10 OMEGARAZER

Yeah the only way to do this would lower the amount of submissions that you can get overall. It'd be like this: Reddit gives us a thousand posts and we weed out whatever is older, say 300 posts. That gives you 700 posts, not 1000 older posts. Reddit doesn't have the option to make age-limited searches so neither does the BDFR.

In other words, were this feature to be implemented then it would be the same as you putting the post date in the filename, and then deleting the files that are older than what you want. Hardly the optimal solution, but it's the best we can do.

Given that this would dramatically cut down on downloads I'm willing to implement it though.

Serene-Arc avatar Oct 14 '22 03:10 Serene-Arc

I'm also curious about this - If one uses the script, passes that file onto --exclude-id-file, and then reruns the app, it'll skip over those items. But does the new log file from the re-run contain information from these excluded files, so that when I run the script again, it'll output a monotonically increasing file of all previous downloads? Or do I need to save every log file and concatenate the outputs together to have a record of all the saved posts so I don't redownload them?

Just wondering what the best strategy is to run this script over particular subreddits that I want to scrape, while ignoring all data I have downloaded in the past from multiple runs of this app?

TimidLo avatar Jan 17 '23 03:01 TimidLo

You need to concatenate it, if using the scripts provided. Technically speaking, the BDFR does output the the skipped submission's ID so you can use a script to find all of those from the log files, but the scripts provided don't search for that.

Serene-Arc avatar Jan 17 '23 03:01 Serene-Arc