Regex search returning greatly reduced hits & incorrect download count
Describe the bug
When using a basic regex match expression (eg a word in title), RMD only finds and downloads a fraction of expected submissions (expectation based on subreddit search on Reddit site). For example:
Searching rocky in Natureporn on the Reddit website returns 25 results directly finding rocky.
A source regex title match for rocky in RMD, All, Top, claimed to find 18 files and download 17 of them, with no failed URLs reported. Only 4 files were actually downloaded. Using New, 2 more files came down. No further files came down with Best or Hot.
I had deleted my manifest files and reloaded to be sure it wasn't reading previous history. There was no deduplication activity I could see in the initial "Top" download. The "New" download appeared to deduplicate some.
The problem seems to occur whatever search I use. Eg searches returning only 10 odd downloads, where Reddit website (and BulkdownloaderforReddit) find 200+.
At first I thought it might be the i.redd.it download bug mentioned in another bug report, but I notice at least one of the files that downloaded seemed to come from an i.redd.it url.
Environment Info
- OS: Win 10
RMD version: 3.0.0, downloaded a few days ago from the master branch.
Hey, thanks for the report.
This may be an issue caused by Reddit's built-in limit of 1000 posts per source. Have you tried running a similar download using the PushShift sources? This may resolve your problem.
If you'd like to search more efficiently, I've also just added a PushShift Search Source, which should be capable of directly searching for terms - and finding unlimited results.
Either way, I'll take a look at the Regex filter again to be sure. Thanks.
I'm having the same issues with PushShift sources. Just to let you know it seems a generalised issue for me at least. Cheers.