RMD is only downloading my first 102 Reddit comments

Open xNul opened this issue 5 years ago • 1 comments

Describe the bug

When using RMD and setting the source to 'a User's comments or submissions' normally or with PushShift, selecting the 'scan comments' option, and setting the 'user' fields to my Reddit account, only the first 102 comments are added to the sqlite database. I have thousands of Reddit comments so RMD should be scraping those.

There are some errors logged but I don't know if they are relevant. I'll include them anyway.

Environment Info

OS: Windows 10 64 bit
RMD Version: Latest 3.1.1

Screenshots/Information

If applicable, add screenshots (or copies of error printouts) to help explain your problem. My console's log:

Started downloader.
Authenticating via OAuth...
Authenticated as [Redacted]

HTTPSConnectionPool(host='static', port=443): Max retries exceeded with url: /apple-touch/wikipedia.png (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000027E11812348>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
HTTPConnectionPool(host='_static', port=80): Max retries exceeded with url: /favicon.ico (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001C5B587D408>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
Handler Exception [imgur] :: {[https://discord.gg/redacted](https://discord.gg/redacted} :: Invalid IPv6 URL
Traceback (most recent call last):
  File "processing\handlers\__init__.py", line 35, in handle
  File "processing\handlers\imgur.py", line 100, in handle
  File "processing\handlers\imgur.py", line 72, in is_imgur
  File "processing\handlers\imgur.py", line 68, in parse_url
  File "urllib\parse.py", line 368, in urlparse
  File "urllib\parse.py", line 459, in urlsplit
ValueError: Invalid IPv6 URL
No connection adapters were found for '[[REDACTED URL HERE]]([REDACTED URL HERE]'
Handler Exception [newspaper] :: {[REDACTED URL HERE]} :: could not create decoder object
Traceback (most recent call last):
  File "processing\handlers\__init__.py", line 35, in handle
  File "processing\handlers\generic_newspaper.py", line 32, in handle
  File "newspaper\article.py", line 261, in parse
  File "newspaper\article.py", line 281, in fetch_images
  File "newspaper\article.py", line 452, in set_top_img
  File "newspaper\images.py", line 224, in satisfies_requirements
  File "newspaper\images.py", line 167, in fetch_image_dimension
  File "newspaper\images.py", line 134, in fetch_url
  File "newspaper\images.py", line 118, in fetch_url
  File "PIL\ImageFile.py", line 413, in feed
  File "PIL\Image.py", line 2808, in open
  File "PIL\Image.py", line 2790, in _open_core
  File "PIL\ImageFile.py", line 106, in __init__
  File "PIL\WebPImagePlugin.py", line 60, in _open
RuntimeError: could not create decoder object
HTTPSConnectionPool(host='s', port=443): Max retries exceeded with url: /hxx1z4/808000/21802e8b5d152c2ab3dde22c557065a3/_/jira-favicon-hires.png (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000027E1180A308>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
HTTPSConnectionPool(host='s', port=443): Max retries exceeded with url: /hxx1z4/808000/21802e8b5d152c2ab3dde22c557065a3/_/jira-favicon-hires.png (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001C5B587D908>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

Dec 18 '20 23:12 xNul

RMD's PushShift library, PSAW, seems to be currently broken with multiple listings. A rewrite is in progress to fix these issues.

Dec 19 '20 00:12 shadowmoose