ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

Corrupt or empty subtitles (OCR, ts, DVB)

Open claunia opened this issue 8 years ago • 12 comments

On some files subtitles appear empty even when the program was subbed, or corrupt, containing garbage characters.

Tried recording from Imagenio and from DVB-T in Spain, happens in all tested broadcasts.

Test files have been put on /repository/Natalia

Regards

claunia avatar Nov 01 '15 17:11 claunia

@claunia we're going to spend a bit of time on this. What's the current status? (with the last CCExtractor I mean)

cfsmp3 avatar Nov 07 '16 20:11 cfsmp3

Hello, I can't seem to find the test files repository for this one!

ghost avatar Nov 28 '16 22:11 ghost

Here's the two files: https://drive.google.com/open?id=0B_61ywKPmI0TLWRwY3Myc0pTMEE https://drive.google.com/open?id=0B_61ywKPmI0TUGctV1hZSkFwalE

cfsmp3 avatar Nov 28 '16 23:11 cfsmp3

GSOC qualification: This issue gives 2 points.

cfsmp3 avatar Jan 20 '17 00:01 cfsmp3

The zip files contain in total of 4 video files:

Star Wars Rebels_Disney Channel_2014-12-12_22-24.ts: The video contains teletext subtitle.

The output is generally good, except 2 lines missing.

It is caused by fuzzy_memcmp in telxcc.c:809, which seems to discard the previous line if the current line has similar content to it.

EDIT: with -nolevdist, the missing lines can now be outputed

In addition, I find -out=spupng doesn't work with teletext. Don't know if it is expected. It crashes because of a bug in ccx_encoders_spupng.c:14. After fixing it (Patch #864 ), it will generate .png files with size of 0 byte (i.e. empty).

Star Wars Rebels_Disney Channel_2014-12-12_22-24_cortado.ts:

It has a teletext subtitle stream but neither VLC nor Potplayer can display any subtitle. CCExtractor can't extract anything from it.

I think it can be because the video itself doesn't actually have any subtitle.

Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts

It contains DVB subtitles, but CCExtractor isn't able to extract anything from it. -out=spupng doesn't work either.

The cause is the stream doesn't send DVBSUB_DISPLAY_SEGMENT. Although the case is considered, it is poorly handled. Patch: #866

Cine Clan TVE Perez, el ratoncito de tus sueños 2_cortado.ts

Same problem as "Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts"

During the debugging, I also discovered a heap corruption problem caused by add_ocrtext2str (Patch: #865 )

harrynull avatar Dec 31 '17 05:12 harrynull

@harrynull use -nolevdist if you want fuzzy_memcpy to behave like memcpy

cfsmp3 avatar Dec 31 '17 08:12 cfsmp3

First one (teletext) works fine. However the 2nd one shows a bunch of messages:

In ocr_bitmap: Failed to perform OCR. Skipped. In ocr_bitmap: Failed to perform OCR. Skipped. In ocr_bitmap: Failed to perform OCR. Skipped.

Takes forever, too.

cfsmp3 avatar Jan 11 '18 21:01 cfsmp3

It is caused by some of the images are totally empty and invalid for some reasons. But it should not affect the output file.

harrynull avatar Jan 12 '18 00:01 harrynull

@harrynull It does, check this out:

670 01:00:15,877 --> 01:00:20,676 Enos oi onimnro nno dnonio otnpnnio pnno oroannthonio.

671 01:00:20,677 --> 01:00:23,116 TI‘QMG. Monono. oi no“n

672 01:00:23,117 --> 01:00:27,756 sono wondido o onion nnos olrozoo non on.

That's total gibberish :-) There's definitely a correlation between those errors and the incorrect lines. It's definitely better than before, and there's lots of good output - but still not perfect.

cfsmp3 avatar Jan 12 '18 20:01 cfsmp3

@cfsmp3 It works well here:

670
01:00:15,877 --> 01:00:20,676
<font color="#00c8c6">Erao ol prlmuro ono onorla</font>
<font color="#00c8c6">atoporlo pora prognnflarlo.</font>

671
01:00:20,677 --> 01:00:23,116
<font color="#00c8c6">Tardo.</font>
<font color="#00c8c6">Mahana. ol ratOn</font>

672
01:00:23,117 --> 01:00:27,756
<font color="#00c8c6">oora vondldo a onlon</font>
<font color="#00c8c6">mao ofruzoa por or.</font>

Did you forget to put spa.traineddata in the right place?

But I do found that sometime doesn't close

24
00:03:54,997 --> 00:03:57,836
<font color="#c7c800">¢Como fue Ia fiesta?</font>
<font color="#c7c800"></font><font color="#d6d6d6">-Estuvimos esperandole.

In addition, some subtitles are skipped and missing. I am not sure if it is limitation of tesseract but I will check them later.

harrynull avatar Jan 13 '18 02:01 harrynull

That stuff in 670, 671 and 672 is not Spanish, believe me :-) (or I suspect, any other language)

On Fri, Jan 12, 2018 at 6:45 PM, Null [email protected] wrote:

@cfsmp3 https://github.com/cfsmp3 It works well here:

670 01:00:15,877 --> 01:00:20,676 Erao ol prlmuro ono onorla atoporlo pora prognnflarlo.

671 01:00:20,677 --> 01:00:23,116 Tardo. Mahana. ol ratOn

672 01:00:23,117 --> 01:00:27,756 oora vondldo a onlon mao ofruzoa por or.

Did you forget to put spa.traineddata in the right place?

But I do found that sometime doesn't close

24 00:03:54,997 --> 00:03:57,836 ¢Como fue Ia fiesta? -Estuvimos esperandole.

In addition, some subtitles are skipped and missing. I am not sure if it is limitation of tesseract but I will check them later.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/243#issuecomment-357404171, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2W1IS-1s-n7YAj-fi_B17p7L4ekoks5tKBivgaJpZM4GZuQu .

cfsmp3 avatar Jan 13 '18 03:01 cfsmp3

Status update: Still broken. Possibly differently. The file that matters is Cine Clan TVE *.ts (ignore the Disney one).

We get lots of these messages:

Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined

and a bonus:

Direct leak of 216 byte(s) in 3 object(s) allocated from:
    #0 0x7f77522bf90f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x556c85248761 in dvbsub_init_decoder ../src/lib_ccx/dvb_subtitle_decoder.c:424
    #2 0x556c8529ee4d in parse_PMT ../src/lib_ccx/ts_tables.c:346
    #3 0x556c85272f9e in ts_readstream ../src/lib_ccx/ts_functions.c:752
    #4 0x556c85275167 in ts_get_more_data ../src/lib_ccx/ts_functions.c:980
    #5 0x556c852a9a9f in general_loop ../src/lib_ccx/general_loop.c:1051
    #6 0x556c851a7986 in api_start ../src/ccextractor.c:205
    #7 0x556c851a9cdb in main ../src/ccextractor.c:463
    #8 0x7f775162350f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

cfsmp3 avatar Mar 22 '23 05:03 cfsmp3