bulk-downloader-for-reddit icon indicating copy to clipboard operation
bulk-downloader-for-reddit copied to clipboard

What about languages with non-latin characters?

Open thomas694 opened this issue 2 years ago • 3 comments

  • [x] I am requesting a new option.
  • [x] I am running the latest version of BDfR
  • [x] I have read the Opening an issue

Description

There occurred a problem with emojis on a windows system using the default file name scheme {REDDITOR}_{TITLE}_{POSTID}. The screenshot in #221 describes a problem with unicode characters of the logger. As a fix (#222) all non-ascii characters are removed. Are emojis a problem in windows filesystems? What about foreign language characters like e.g. japanese, korean or chinese?

A simple solution is to add a line to set the encoding to UTF-8 in create_file_logger in connector.py#L224:

        file_handler = logging.handlers.RotatingFileHandler(
            log_path,
            mode="a",
            backupCount=backup_count,
            encoding="utf-8"
        )

Out of curiosity, does the logger behave differently on linux systems or are unicode characters just missing in the log files?

Can we get back the bigger range of characters by skipping that _strip_emojis method? Probably through a new option to remain backward compatible.

thomas694 avatar Feb 10 '23 21:02 thomas694

No, this is not possible. Unicode characters are included in Linux log files, but Windows uses a severely restricted character set known as Windows-1252. If you want the full character set, then the only solution is to run the BDFR on Linux, Unix, or a derivative system, such as MacOS.

Serene-Arc avatar Feb 10 '23 23:02 Serene-Arc

The changed version works pretty well here.

Second, I'm sure Windows uses a code page that fits the region it is used in, but not always 1252 (437 here). But if you tell your program to write files in UTF-8, instead of an automatically chosen code page by the library, the files for sure contain unicode characters [when writing unicode characters] and the files are more independent from regional settings and the like.

thomas694 avatar Feb 11 '23 01:02 thomas694

If Windows doesn't already write UTF-8 to log files, that can be done with an enhancement.

Serene-Arc avatar Feb 11 '23 02:02 Serene-Arc