bulk-downloader-for-reddit icon indicating copy to clipboard operation
bulk-downloader-for-reddit copied to clipboard

[BUG] Problem with special characters in file name

Open Fakeaccount12312 opened this issue 1 year ago • 2 comments

  • [x] I am reporting a bug.
  • [x] I am running the latest version of BDfR
  • [x] I have read the Opening an issue

Description

I noticed that bdfr crashes when attempting to download posts with special characters in the title which are not allowed in file names of some file systems (like "*\~). For example this post (ignore the necrophilia joke): https://www.reddit.com/r/197/comments/xdi48h/ I tried to clone it with this command: python -m bdfr clone -v -l https://www.reddit.com/r/197/comments/xdi48h/ "" And got this error:

[2022-09-18 23:48:58,201 - bdfr.connector - DEBUG] - Setting maximum download wait time to 120 seconds
[2022-09-18 23:48:58,201 - bdfr.connector - DEBUG] - Setting datetime format string to ISO
[2022-09-18 23:48:58,202 - bdfr.connector - DEBUG] - Disabling the following modules: 
[2022-09-18 23:48:58,202 - bdfr.connector - DEBUG] - Using unauthenticated Reddit instance
[2022-09-18 23:48:59,541 - bdfr.downloader - DEBUG] - Attempting to download submission xdi48h
[2022-09-18 23:49:02,853 - bdfr.downloader - DEBUG] - Using YtdlpFallback with url https://v.redd.it/aq4adlbbuon91
[2022-09-18 23:49:11,267 - bdfr.downloader - ERROR] - Failed to write file in submission xdi48h to /media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4'
[2022-09-18 23:49:11,267 - bdfr.archive_entry.submission_archive_entry - DEBUG] - Retrieving full comment tree for submission xdi48h
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/__main__.py", line 154, in <module>
    cli()
  File "/usr/lib/python3/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/__main__.py", line 120, in cli_clone
    reddit_scraper.download()
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/cloner.py", line 21, in download
    self.write_entry(submission)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 75, in write_entry
    self._write_entry_json(archive_entry)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 87, in _write_entry_json
    self._write_content_to_disk(resource, content)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 102, in _write_content_to_disk
    with open(file_path, 'w', encoding="utf-8") as file:
OSError: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.json'

Note how downloading just fails, but archiving crashes bdfr, which is more annoying when downloading multiple posts.

Now I noticed this only happens when downloading to my USB-Stick, so this isn't really easily reproducible as I work on linux, old hardware and did a custom format of the USB stick (exFAT). Therefore I'd suggest a more general way to deal with this kind of problem.

My suggestion would be either an option that automatically removes these problematic characters from file names, simply adding a catch so archiving fails for this file but doesn't crash the whole process or even replacing these characters if this operation fails and trying again. The second one would obviously be the easiest to implement and makes this problem more managable, the first is more of a feature request.

Command

python -m bdfr clone -v -l https://www.reddit.com/r/197/comments/xdi48h/ ""

Environment (please complete the following information):

  • OS: Linux Lite 6
  • Python version: 3.10.4
  • Fresh install

Logs

[2022-09-19 00:17:58,009 - bdfr.connector - DEBUG] - Setting maximum download wait time to 120 seconds
[2022-09-19 00:17:58,010 - bdfr.connector - DEBUG] - Setting datetime format string to ISO
[2022-09-19 00:17:58,011 - bdfr.connector - DEBUG] - Disabling the following modules: 
[2022-09-19 00:17:58,011 - bdfr.connector - Level 9] - Created download filter
[2022-09-19 00:17:58,011 - bdfr.connector - Level 9] - Created time filter
[2022-09-19 00:17:58,011 - bdfr.connector - Level 9] - Created sort filter
[2022-09-19 00:17:58,011 - bdfr.connector - Level 9] - Create file name formatter
[2022-09-19 00:17:58,012 - bdfr.connector - DEBUG] - Using unauthenticated Reddit instance
[2022-09-19 00:17:58,013 - bdfr.connector - Level 9] - Created site authenticator
[2022-09-19 00:17:58,013 - bdfr.connector - Level 9] - Retrieved subreddits
[2022-09-19 00:17:58,014 - bdfr.connector - Level 9] - Retrieved multireddits
[2022-09-19 00:17:58,014 - bdfr.connector - Level 9] - Retrieved user data
[2022-09-19 00:17:58,014 - bdfr.connector - Level 9] - Retrieved submissions for given links
[2022-09-19 00:17:59,011 - bdfr.downloader - DEBUG] - Attempting to download submission xdi48h
[2022-09-19 00:18:04,075 - bdfr.downloader - DEBUG] - Using YtdlpFallback with url https://v.redd.it/aq4adlbbuon91
[2022-09-19 00:18:12,746 - bdfr.downloader - ERROR] - [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4'
Traceback (most recent call last):
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/downloader.py", line 110, in _download_submission
    with open(destination, 'wb') as file:
OSError: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4'
[2022-09-19 00:18:12,746 - bdfr.downloader - ERROR] - Failed to write file in submission xdi48h to /media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4'
[2022-09-19 00:18:12,747 - bdfr.archive_entry.submission_archive_entry - DEBUG] - Retrieving full comment tree for submission xdi48h
[2022-09-19 00:18:12,751 - root - ERROR] - Scraper exited unexpectedly
Traceback (most recent call last):
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/__main__.py", line 120, in cli_clone
    reddit_scraper.download()
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/cloner.py", line 21, in download
    self.write_entry(submission)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 75, in write_entry
    self._write_entry_json(archive_entry)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 87, in _write_entry_json
    self._write_content_to_disk(resource, content)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 102, in _write_content_to_disk
    with open(file_path, 'w', encoding="utf-8") as file:
OSError: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.json'

Fakeaccount12312 avatar Sep 18 '22 22:09 Fakeaccount12312

I am pretty sure this is already managed on windows in some way, but I currently don't have access to a Windows machine to check.

Fakeaccount12312 avatar Sep 18 '22 22:09 Fakeaccount12312

We do have logic to screen out character names, though I have no clue what is in that filename that is tripping Windows up.

Serene-Arc avatar Sep 19 '22 01:09 Serene-Arc