RedditDownloader icon indicating copy to clipboard operation
RedditDownloader copied to clipboard

Regex search returning greatly reduced hits & incorrect download count

Open Saortica opened this issue 6 years ago • 2 comments

Describe the bug

When using a basic regex match expression (eg a word in title), RMD only finds and downloads a fraction of expected submissions (expectation based on subreddit search on Reddit site). For example:

Searching rocky in Natureporn on the Reddit website returns 25 results directly finding rocky.

A source regex title match for rocky in RMD, All, Top, claimed to find 18 files and download 17 of them, with no failed URLs reported. Only 4 files were actually downloaded. Using New, 2 more files came down. No further files came down with Best or Hot.

I had deleted my manifest files and reloaded to be sure it wasn't reading previous history. There was no deduplication activity I could see in the initial "Top" download. The "New" download appeared to deduplicate some.

The problem seems to occur whatever search I use. Eg searches returning only 10 odd downloads, where Reddit website (and BulkdownloaderforReddit) find 200+.

At first I thought it might be the i.redd.it download bug mentioned in another bug report, but I notice at least one of the files that downloaded seemed to come from an i.redd.it url.

Environment Info

  • OS: Win 10

RMD version: 3.0.0, downloaded a few days ago from the master branch.

Saortica avatar Nov 18 '19 16:11 Saortica

Hey, thanks for the report.

This may be an issue caused by Reddit's built-in limit of 1000 posts per source. Have you tried running a similar download using the PushShift sources? This may resolve your problem.

If you'd like to search more efficiently, I've also just added a PushShift Search Source, which should be capable of directly searching for terms - and finding unlimited results.

Either way, I'll take a look at the Regex filter again to be sure. Thanks.

shadowmoose avatar Nov 19 '19 01:11 shadowmoose

I'm having the same issues with PushShift sources. Just to let you know it seems a generalised issue for me at least. Cheers.

Saortica avatar Nov 24 '19 03:11 Saortica