bulk-downloader-for-reddit
bulk-downloader-for-reddit copied to clipboard
[BUG] error with 7-digit id files
- [x] I am reporting a bug.
- [x] I am running the latest version of BDfR
- [x] I have read the Opening an issue
Description
I'm downloading submissions with the --include-id-file
parameter. However the scraper crashes when encountering an id with 7-digits.
an id such as 10cet41
(nsfw) in a text file can cause this error.
Logs
[2023-05-01 23:49:58,945 - bdfr.connector - DEBUG] - Disabling the following modules:
[2023-05-01 23:49:58,945 - bdfr.connector - Level 9] - Created download filter
[2023-05-01 23:49:58,946 - bdfr.connector - Level 9] - Created time filter
[2023-05-01 23:49:58,946 - bdfr.connector - Level 9] - Created sort filter
[2023-05-01 23:49:58,957 - bdfr.connector - Level 9] - Create file name formatter
[2023-05-01 23:49:58,957 - bdfr.connector - DEBUG] - Using authenticated Reddit instance
[2023-05-01 23:49:58,958 - bdfr.oauth2 - Level 9] - Loaded OAuth2 token for authoriser
[2023-05-01 23:49:59,171 - bdfr.oauth2 - Level 9] - Written OAuth2 token from authoriser to C:\Users\****\AppData\Local\BDFR\bdfr\default_config.cfg
[2023-05-01 23:49:59,433 - bdfr.connector - Level 9] - Resolved user to ****
[2023-05-01 23:49:59,433 - bdfr.connector - Level 9] - Created site authenticator
[2023-05-01 23:49:59,434 - bdfr.connector - Level 9] - Retrieved subreddits
[2023-05-01 23:49:59,434 - bdfr.connector - Level 9] - Retrieved multireddits
[2023-05-01 23:49:59,434 - bdfr.connector - Level 9] - Retrieved user data
[2023-05-01 23:49:59,434 - bdfr.connector - Level 9] - Retrieved submissions for given links
[2023-05-01 23:49:59,586 - root - ERROR] - Scraper exited unexpectedly
Traceback (most recent call last):
File "C:\Users\****\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\bdfr\__main__.py", line 161, in cli_clone
reddit_scraper.download()
File "C:\Users\****\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\bdfr\cloner.py", line 26, in download
self._download_submission(submission)
File "C:\Users\****\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\bdfr\downloader.py", line 62, in _download_submission
elif submission.subreddit.display_name.lower() in self.args.skip_subreddit:
File "C:\Users\****\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\praw\models\reddit\base.py", line 34, in __getattr__
self._fetch()
File "C:\Users\****\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\praw\models\reddit\comment.py", line 195, in _fetch
raise ClientException(f"No data returned for comment {self.fullname}")
praw.exceptions.ClientException: No data returned for comment t1_10cet41
Did you find any solution to this?
Did you find any solution to this?
Nah I haven't found any workarounds for it yet.
I already reported this issue 6 weeks ago in #839 , so far there have been no updates.
Apologies, I'm in the last week of my semester and have exams this week. Once they're done I'll have more time to close a bunch of issues and make a new release.
The issue is in https://github.com/aliparlakci/bulk-downloader-for-reddit/blob/8c293a46843c818bea2c2013db38191867993a14/bdfr/archiver.py#L62 where 7-digit ids cause the archiver to create a PRAW comment
instead of submission
I don't know how to fix it cleanly, but a workaround is to replace line 61 with this:
if len(sub_id) in (6, 7):
Which will capture those 7-digit ids too.
The whole thing is just roughly made and doesn't work in many cases. Both post and comment IDs on Reddit are incrementing continuously and can have any number of digits, they are just both at 7 digits now. But there are Reddit posts with a two-digit id. Not sure how to solve than one but the current solution is already not clean at all. This is not just an issue with 7-digit-IDs.
Your fix treats every 6-7 digit id as post, that is ok for me since I don't download comments from an ID file, but this really needs a rework.
For now, my fix is to just treat everything as a post, I replaced line 60 with if re.match(r"^\w{2,7}$", sub_id):
and deleted lines 62-63 completely. Not sure how you would even detect comments.
Edit: I was very right. For example the ID c9h18 refers to both a post and a comment.
If you have any suggestions, then please do elaborate. If you have knowledge of those 2 digit IDs, please provide them. The first step is getting actual IDs that we can test and find. Once we have those, we can write the tests to proof the logic.