RedditDownloader
RedditDownloader copied to clipboard
RMD is only downloading my first 102 Reddit comments
Describe the bug
When using RMD and setting the source to 'a User's comments or submissions' normally or with PushShift, selecting the 'scan comments' option, and setting the 'user' fields to my Reddit account, only the first 102 comments are added to the sqlite database. I have thousands of Reddit comments so RMD should be scraping those.
There are some errors logged but I don't know if they are relevant. I'll include them anyway.
Environment Info
- OS: Windows 10 64 bit
- RMD Version: Latest 3.1.1
Screenshots/Information
If applicable, add screenshots (or copies of error printouts) to help explain your problem. My console's log:
Started downloader.
Authenticating via OAuth...
Authenticated as [Redacted]
HTTPSConnectionPool(host='static', port=443): Max retries exceeded with url: /apple-touch/wikipedia.png (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000027E11812348>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
HTTPConnectionPool(host='_static', port=80): Max retries exceeded with url: /favicon.ico (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001C5B587D408>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
Handler Exception [imgur] :: {[https://discord.gg/redacted](https://discord.gg/redacted} :: Invalid IPv6 URL
Traceback (most recent call last):
File "processing\handlers\__init__.py", line 35, in handle
File "processing\handlers\imgur.py", line 100, in handle
File "processing\handlers\imgur.py", line 72, in is_imgur
File "processing\handlers\imgur.py", line 68, in parse_url
File "urllib\parse.py", line 368, in urlparse
File "urllib\parse.py", line 459, in urlsplit
ValueError: Invalid IPv6 URL
No connection adapters were found for '[[REDACTED URL HERE]]([REDACTED URL HERE]'
Handler Exception [newspaper] :: {[REDACTED URL HERE]} :: could not create decoder object
Traceback (most recent call last):
File "processing\handlers\__init__.py", line 35, in handle
File "processing\handlers\generic_newspaper.py", line 32, in handle
File "newspaper\article.py", line 261, in parse
File "newspaper\article.py", line 281, in fetch_images
File "newspaper\article.py", line 452, in set_top_img
File "newspaper\images.py", line 224, in satisfies_requirements
File "newspaper\images.py", line 167, in fetch_image_dimension
File "newspaper\images.py", line 134, in fetch_url
File "newspaper\images.py", line 118, in fetch_url
File "PIL\ImageFile.py", line 413, in feed
File "PIL\Image.py", line 2808, in open
File "PIL\Image.py", line 2790, in _open_core
File "PIL\ImageFile.py", line 106, in __init__
File "PIL\WebPImagePlugin.py", line 60, in _open
RuntimeError: could not create decoder object
HTTPSConnectionPool(host='s', port=443): Max retries exceeded with url: /hxx1z4/808000/21802e8b5d152c2ab3dde22c557065a3/_/jira-favicon-hires.png (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000027E1180A308>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
HTTPSConnectionPool(host='s', port=443): Max retries exceeded with url: /hxx1z4/808000/21802e8b5d152c2ab3dde22c557065a3/_/jira-favicon-hires.png (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001C5B587D908>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
RMD's PushShift library, PSAW, seems to be currently broken with multiple listings. A rewrite is in progress to fix these issues.