cuetools icon indicating copy to clipboard operation
cuetools copied to clipboard

Tag problem and bad character errors when dealing with Japanese cue file

Open vymague opened this issue 2 years ago • 9 comments

Version: cuetools 1.4.1-4 from Arch's repo

Similar issue with flacon. https://github.com/flacon/flacon/issues/176

I tried to use cuetag.sh problematic.cue *.flac to tag my split .flac files. But the files are tagged with #### (number sign, hash, pound sign). For all fields: title, artists, etc.

The tags are originally in Japanese. And the .flac files were split using shnsplit.
I haven't tried reproducing the problem with other Asian languages like Chinese or Korean.

I've also tried cueprint problematic.cue. It prints the track information correctly with Japanese titles and so on. But has this error in the beginning:

bad character '�'
bad character '�'
bad character '�'
bad character 'R'
bad character 'E'
bad character 'M'

My default locale is en_US.UTF-8.

List of enabled locale (locale -a):

C
en_US.utf8
ja_JP.utf8
POSIX
zh_CN.utf8
zh_TW.utf8

The problematic .cue file: problematic.cue.zip

I don't see anyone else noticed the problem. So could be just on my end.

edit: Removing BOM only fixes the bad character error. But not the tag problem.

vymague avatar Mar 13 '22 07:03 vymague

Someone figured out that it's an encoding issue with flac library in the case of flacon. Could be just on Arch. https://github.com/flacon/flacon/issues/176#issuecomment-1078955451

edit: He also suggested workarounds that work for flacon. Probably can work for cuetools too.

vymague avatar Mar 25 '22 12:03 vymague

I've met with the same error. In my case, it was caused by BOM in the beginning of the CUE file (so the parser fails at three unrecognizable character, followed by the first REM directive).

Stripping the BOM fixed the problem for me.

CircuitCoder avatar Mar 29 '22 19:03 CircuitCoder

Thanks, @CircuitCoder.

I checked that the cue I posted has byte order mark (BOM) with file. And removed it with

sed '1s/^\xEF\xBB\xBF//' < problematic.cue > new.cue

as per https://unix.stackexchange.com/questions/381230/how-can-i-remove-the-bom-from-a-utf-8-file/381263#381263

Afterwards cueprint doesn't get the bad character error anymore. I'll give it a try if cuetag.sh work this weekend.

vymague avatar Mar 30 '22 08:03 vymague

Ok, didn't work for the tag issue.

Basically I split a .wav file into flac this way:

shnsplit -f problematic.cue -o flac -t "%n %t" original.wav

This produces the flacs with the correct Japanese title.

Then I tag them with:

cuetag.sh problematic.cue *.flac

This produces flacs with #### (number sign, hash, pound sign) in their tags.

I have removed the BOM in the problematic.cue as suggested. And cueprint didn't say any bad character error.

vymague avatar Mar 30 '22 09:03 vymague

Maybe metaflac needs --no-utf8-convert

Adding that option, together with importing tags from file, worked for me when tags contain cjk characters with metaflac.

When using curtag.sh for batch tagging, I had to edit cuetag.sh and add that option to METAFLAC.

This behavior of metaflac actually seems really odd to me. I suppose one would not need to perform utf8 conversion on any locale already using utf-8. But I guess this is how metaflac works.

CircuitCoder avatar Mar 30 '22 16:03 CircuitCoder

Sounds related to the sleuthing done on the flacon thread. https://github.com/flacon/flacon/issues/176#issuecomment-1078955451

It's probably a bug with UTF-8 conversion of the flac library. But dunno.

vymague avatar Mar 30 '22 18:03 vymague

Yes. It seems like when building using CMake (which is what ArchLinux is doing), flac never uses LANGINFO (Compare https://github.com/xiph/flac/blob/master/CMakeLists.txt with https://github.com/xiph/flac/blob/master/build/config.mk#L157)

I'm going to open an issue there. Thanks for your pointers!

CircuitCoder avatar Mar 30 '22 19:03 CircuitCoder

I just manually added -DHAVE_LANGINFO_CODESET into flac's CMakeLists.txt. This is the test result (the newly built version sits in /usr/local)

➜  tmp metaflac test.flac --set-tag=ARTIST=喵
➜  tmp metaflac test.flac --set-tag=ARTIST=喵 --no-utf8-convert
➜  tmp CHARSET=UTF-8 metaflac test.flac --set-tag=ARTIST=喵
➜  tmp /usr/local/bin/metaflac test.flac --set-tag=ARTIST=喵
➜  tmp metaflac test.flac --export-tags-to=- --no-utf8-convert
ARTIST=###
ARTIST=喵
ARTIST=喵
ARTIST=喵

This is the expected result, because when not compiled with langinfo, flac tries to read the environment variable CHARSET for charset. When compiled with langinfo, flac can actually read the locale, and solves the issue.

CircuitCoder avatar Mar 30 '22 20:03 CircuitCoder

Thanks a lot for the explanation, @CircuitCoder.

vymague avatar Mar 30 '22 22:03 vymague