ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

[BUG] A mix of 8-bit/16-bit chars sent to iconv

Open erankor opened this issue 1 year ago • 8 comments

Necessary information

  • Is this a regression (i.e. did it work before)? NO
  • What platform did you use? Linux
  • What were the used arguments? ./ccextractor test.ts -svc all[UTF-16BE] -nofc -12

Video links

http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts

Additional information

Hi all,

I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly. After some debugging, I found that if I patch the function write_utf16_char here - https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113 to always output 2 byte chars (I changed the if to if (1)), and I specify an encoding of UTF-16BE, it decodes properly.

This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...). Maybe when iconv is used, the function should always output 2 byte chars? Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...).

Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708.

Thanks!

Eran

erankor avatar Aug 24 '22 06:08 erankor

Could you share the output of ccextractor --version?

PunitLodha avatar Aug 24 '22 13:08 PunitLodha

./ccextractor --version
CCExtractor 0.89, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.89
        Git commit: b793f16343dc442bcb977387fcef08195e71dd7c
        Compilation date: 2022-08-23
        File SHA256: 259ccd18d508a3aed03149080853f98d1bce57672ce20c9b715953227621c9d9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

erankor avatar Aug 24 '22 13:08 erankor

You are using version 0.89. Could you try using the latest version(0.94)?

PunitLodha avatar Aug 24 '22 13:08 PunitLodha

Reverted my change and pulled latest master, it is decoding stuff (which is better than previous version IIRC...), but still every space in the text messes it up, and I get some non-printable chars in the output.

Output without any code changes - 1 00:00:01,068 --> 00:00:03,770 人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

Output after forcing write_utf16_char to always use 2 chars - 1 00:00:01,068 --> 00:00:03,770 人々が私を知 ったとき、私は 時間管理につい て書いています

I don't speak Japanese myself :) but google translate can confirm the fixed version is better.

Current version -

./ccextractor --version
CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.94
        Git commit: 4cb474c5a36b61bafec4a2379c4d0b240e44359b
        Compilation date: 2022-08-24
        CEA-708 decoder: C
        File SHA256: 8fd4f5625eb6aadb30532a2ff9f29adaec4b60a77916e3f001d5f4e59d4d08e9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

erankor avatar Aug 24 '22 14:08 erankor

You could send a PR. If it doesn't cause any issues with the other tests, then we can merge it

PunitLodha avatar Aug 24 '22 17:08 PunitLodha