rgain3 icon indicating copy to clipboard operation
rgain3 copied to clipboard

Fails on filenames that use a character encoding different from the system

Open StyXman opened this issue 4 years ago • 10 comments

I have a friend that has a audio collection that predates the general availability of UTF-8 on OSs. He also has a lot of music with band, album and son names that include non ascii chars. Combine those two and you get:

Traceback (most recent call last):
  File "/usr/bin/collectiongain", line 6, in <module>
    collectiongain()
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 341, in collectiongain
    do_collectiongain(args[0], opts.ref_level, opts.force, opts.dry_run,
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 274, in do_collectiongain
    collect_files(music_dir, files, visited_cache,
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 117, in collect_files
    print("  [%i] %s |" % (i, filepath), end='')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udced' in position 49: surrogates not allowed

Notice that these are valid filenames (from the OS point of view; on Unix, any char except \0x00 and / can be part of the path), just not valid UTF-8. Yes, he could sit down and rename all those files and directories, but I guess he won't be the only one.

OTOH, you could say 'go fix your filenames' and we will understand. Cheers!

StyXman avatar Jul 29 '20 17:07 StyXman

Thanks for reporting.

Non UTF-8 file names are definitely something the script should be able to deal with. You're probably right, that your friend won't be the only one.

This problem should be solvable by making use of PEP 383.

chaudum avatar Sep 03 '20 20:09 chaudum

This regression has probably been introduced with 6de774076d76ded856c03968495b90001d293035

@StyXman could you try a Python3 compatible version prior to this commit?

git clone https://github.com/chaudum/rgain.git
cd rgain
git checkout aef5bde971c204d46e11a5f808aa4152cefa9687
python3 -m venv env
env/bin/python -m pip install -Ue .

chaudum avatar Oct 27 '20 20:10 chaudum

@StyXman Unfortunately I could not reproduce your issue yet. I tried to create files with random bytes as filenames, but did not succeed either - ran into a different issue:

$ python
Python 3.8.6 (default, Sep 25 2020, 09:36:53) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir()
['album-tag.mp3']
>>> os.rename('album-tag.mp3', os.urandom(4)+b'.mp3')
>>> os.listdir()
['\udcdb\udcc3\udcc0L.mp3']
$ env/bin/collectiongain /tmp/tmp.iEg1y395Tw
Collecting files ...
  [1] ���L.mp3 |Test Album
Dispatching jobs ...
Now waiting for results ...
Unfortunately, there were some errors:
Test Album:Checking for Replay Gain information ...
  /tmp/tmp.iEg1y395Tw/���L.mp3:none
Calculating Replay Gain information ...
Traceback (most recent call last):
  File "/home/christian/sandbox/chaudum/rgain/rgain3/replaygain.py", line 112, in do_gain
    tracks_data, albumdata = calculate_gain(files, ref_level)
  File "/home/christian/sandbox/chaudum/rgain/rgain3/replaygain.py", line 53, in calculate_gain
    rg.start()
  File "/home/christian/sandbox/chaudum/rgain/rgain3/lib/rgcalc.py", line 93, in start
    if not self._next_file():
  File "/home/christian/sandbox/chaudum/rgain/rgain3/lib/rgcalc.py", line 184, in _next_file
    self.src.set_property("location", fname)
TypeError: could not convert '/tmp/tmp.iEg1y395Tw/\udcdb\udcc3\udcc0L.mp3' to type 'gchararray' when setting property 'GstFileSrc.location'


0 successful, 1 failed.
All finished.

chaudum avatar Nov 10 '20 21:11 chaudum

Could you provide information about your Python version and encoding?

python --version

python -c "import sys; print(sys.getfilesystemencoding(), sys.getdefaultencoding())"

locale

chaudum avatar Nov 10 '20 21:11 chaudum

Could you provide information about your Python version and encoding?

@StyXman :arrow_up:

chaudum avatar Jan 26 '21 20:01 chaudum

Sorry, busy with life :(

mdione@diablo:~$ python3
Python 3.9.1+ (default, Jan 10 2021, 15:42:50)
[GCC 10.2.1 20201224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ
environ({'LANGUAGE': 'en_US:es:fr:it', 'LANG': 'en_US.UTF-8', 'LC_TIME': 'es_AR.UTF-8'})

I was pretty sure at least LC_ALL would be en_US.UTF-8. I guess LANG is picked up instead?

StyXman avatar Jan 27 '21 07:01 StyXman

Ah:

mdione@diablo:~$ python3 -c "import sys; print(sys.getfilesystemencoding(), sys.getdefaultencoding())"
utf-8 utf-8
mdione@diablo:~$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:es:fr:it
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME=es_AR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

StyXman avatar Jan 27 '21 07:01 StyXman

Thanks, will have another try whether I can reproduce the issue on my machine.

chaudum avatar Jan 27 '21 07:01 chaudum

I am also having this problem. My OS is Ubuntu 22.04.4. I installed rgain via apt install replaygain

My failing output:

Collecting files ...
Traceback (most recent call last):
  File "/usr/bin/collectiongain", line 6, in <module>
    collectiongain()
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 341, in collectiongain
    do_collectiongain(args[0], opts.ref_level, opts.force, opts.dry_run,
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 274, in do_collectiongain
    collect_files(music_dir, files, visited_cache,
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 117, in collect_files
    print("  [%i] %s |" % (i, filepath), end='')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcea' in position 53: surrogates not allowed

python3 -version:

Python 3.10.12

python3 -c "import sys; print(sys.getfilesystemencoding(), sys.getdefaultencoding())":

utf-8 utf-8

locale:

LANG=en_CA.UTF-8
LANGUAGE=en_CA:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I am happy to report any other information that can help diagnose this problem.

brettpim avatar Mar 06 '24 00:03 brettpim

Thanks for reporting.

Non UTF-8 file names are definitely something the script should be able to deal with. You're probably right, that your friend won't be the only one.

This problem should be solvable by making use of PEP 383.

How can I try to use PEP 383 to try to solve this issue?

brettpim avatar Mar 06 '24 00:03 brettpim