tubesync icon indicating copy to clipboard operation
tubesync copied to clipboard

No Support for Cyrillic characters in file names

Open gzavadskiy opened this issue 3 years ago • 5 comments

Please add support for Cyrillic (unicode?) characters in file names. At this moment they are cuting off in file names. From title "Выполняем тестовое задание на Junior Python разработчика с зарплатой 70000р | PDF в MP3" trasform to file name - "2022-05-01_pythontoday_junior-python-70000-pdf-mp3_Q0lHb-FCATk_1080p-vp9-opus.mkv"

gzavadskiy avatar May 10 '22 18:05 gzavadskiy

This is already on the to-do list due to another issue. Initially TubeSync allowed full Unicode filenames, however this caused some network shares that didn't fully support long Unicode filenames so I went the other way and stripped out everything that wasn't a-z, A-Z and 0-9 with a few hyphens etc. As you report this isn't useful for anything that has a title not in a Latin-based alphabet which is too conservative the other way. I am testing a few libraries to allow "safe" Unicode for filenames and will probably integrate one of those when this gets fixed. Thanks for the issue.

meeb avatar May 11 '22 02:05 meeb

may be it will be good to use this: https://pypi.org/project/transliterate/

Bi-directional transliterator for Python. Transliterates (unicode) strings according to the rules specified in the language packs (source script <-> target script).

Comes with language packs for the following languages (listed in alphabetical order):

Armenian Bulgarian (beta) Georgian Greek Macedonian (alpha) Mongolian (alpha) Russian Serbian (alpha) Ukrainian (beta) There are also a number of useful tools included, such as:

Simple lorem ipsum generator, which allows lorem ipsum generation in the language chosen. Language detection for the text (if appropriate language pack is available). Slugify function for non-latin texts.

gzavadskiy avatar May 11 '22 12:05 gzavadskiy

Thanks, I had seen a couple of libraries like that. It would certainly help specifically for Cyrillic languages I was generally hoping to find an off the shelf solution for all languages that makes filenames "safe" (e.g. Japanese or Thai etc.) and that can properly strip emojis. Currently a character allow list approach has been much easier than attempting a block list given how vast the Unicode space is. If need be I'll fall back to character packs for different languages, but this has to be a problem others have found solutions for already.

meeb avatar May 12 '22 03:05 meeb