mutagen icon indicating copy to clipboard operation
mutagen copied to clipboard

mid3v2 crashes with "UnicodeEncodeError: surrogates not allowed" on files with accented characters in the filename

Open martinwguy opened this issue 1 month ago • 1 comments

Trying to see whether ISRC tags are present in a large audio collection using mid3v2 -l 00*/*3 | grep -a TSRC it dies halfway through, saying

IDv2 tag info for 00-225167/mina - volami nel cuore.mp3
TIT2=Volami nel cuore
TPE1=MINA
TRCK=1
IDv2 tag info for Traceback (most recent call last):
  File "/usr/bin/mid3v2", line 33, in <module>
    sys.exit(load_entry_point('mutagen==1.46.0', 'console_scripts', 'mid3v2')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 484, in entry_point
    return main(sys.argv)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 469, in main
    list_tags(args)
  File "/usr/lib/python3/dist-packages/mutagen/_tools/mid3v2.py", line 335, in list_tags
    print("IDv2 tag info for", filename)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc85' in position 13: surrogates not allowed

This isn't Mina's fault; it's the following file's name which is ANSI or CP437 encoded: "modà - la notte.mp3" where à is represented by character 0x85. The same goes for other files whose names contain 0x8A for è, 0xB4 for é, 0x95 for ò, 0x97 for ù, 0xA2 for ó and so on.

On Debian GNU/Linux with LANG=en_GB.UTF-8

martinwguy avatar May 20 '24 08:05 martinwguy