ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

[BUG] ocr.c writing outside allocated memory

Open dcjm opened this issue 4 years ago • 9 comments

I am raising this as an issue but I do have a fix for the problem so it could be a pull request. However, this is a full explanation of the problem in case you want to try a different approach.

When running ccextractor on some ts files I found two cases where malloc was reporting memory corruption. This was tested first with the 0.88 release and then with git master. I ran valgrind and got the following:

==32409== Invalid write of size 1 ==32409== at 0x483CA14: memmove (vg_replace_strmem.c:1270) ==32409== by 0x13F423: ocr_bitmap (ocr.c:671) ==32409== by 0x13FEEC: ocr_rect (ocr.c:915) ==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663) ==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726) ==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832) ==32409== by 0x149BEA: process_data (general_loop.c:644) ==32409== by 0x14AB6E: general_loop (general_loop.c:1019) ==32409== by 0x137058: api_start (ccextractor.c:213) ==32409== by 0x137D7F: main (ccextractor.c:534) ==32409== Address 0x9b75967 is 0 bytes after a block of size 135 alloc'd ==32409== at 0x483577F: malloc (vg_replace_malloc.c:299) ==32409== by 0x13F07E: ocr_bitmap (ocr.c:595) ==32409== by 0x13FEEC: ocr_rect (ocr.c:915) ==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663) ==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726) ==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832) ==32409== by 0x149BEA: process_data (general_loop.c:644) ==32409== by 0x14AB6E: general_loop (general_loop.c:1019) ==32409== by 0x137058: api_start (ccextractor.c:213) ==32409== by 0x137D7F: main (ccextractor.c:534) ==32409== ==32409== Invalid write of size 1 ==32409== at 0x13F466: ocr_bitmap (ocr.c:679) ==32409== by 0x13FEEC: ocr_rect (ocr.c:915) ==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663) ==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726) ==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832) ==32409== by 0x149BEA: process_data (general_loop.c:644) ==32409== by 0x14AB6E: general_loop (general_loop.c:1019) ==32409== by 0x137058: api_start (ccextractor.c:213) ==32409== by 0x137D7F: main (ccextractor.c:534) ==32409== Address 0x9b7596a is 3 bytes after a block of size 135 alloc'd ==32409== at 0x483577F: malloc (vg_replace_malloc.c:299) ==32409== by 0x13F07E: ocr_bitmap (ocr.c:595) ==32409== by 0x13FEEC: ocr_rect (ocr.c:915) ==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663) ==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726) ==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832) ==32409== by 0x149BEA: process_data (general_loop.c:644) ==32409== by 0x14AB6E: general_loop (general_loop.c:1019) ==32409== by 0x137058: api_start (ccextractor.c:213) ==32409== by 0x137D7F: main (ccextractor.c:534)

Adding assertions and then looking at the values with gdb showed that the assumption in the comment on line 635 is incorrect:

(gdb) print text_out $1 = 0x5555570b91f0 "Of course. I'll do\nwhatever | can to h<font color="#ffff00">elp.\n"

(gdb) print text_out $1 = 0x555556caa9a0 "Ford resented him.\nSo he cooked up <font color="#ffff00">a pack of lies\n"

The overflow comes about because the loop at line 650 leaves last_font_tag pointing to the start of the line but the call to strstr on line 658 sets last_font_tag_end to the closing > on the subsequent line. The fix is to add if (last_font_tag_end > line_end) last_font_tag_end = NULL; after line 658.

If you want to investigate this yourselves I can put the subtitle streams on a public server. There may be a better way to fix this.

dcjm avatar Apr 06 '20 17:04 dcjm

@NilsIrl want to look into this?

cfsmp3 avatar Apr 06 '20 19:04 cfsmp3

If you want to investigate this yourselves I can put the subtitle streams on a public server.

That would be very helpful.

NilsIrl avatar Apr 07 '20 10:04 NilsIrl

The streams are here. These streams contain only the subtitles. The video and audio have been stripped with ffmpeg.

Ccextractor was built on a Debian Buster (amd64) machine. uname -a Linux athena 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

The current version, including my fix and the added assertions is: CCExtractor detailed version info Version: 0.88 Git commit: 6b6f548531b7b681d5bd99a563077e6fd80a8ba9 Compilation date: 2020-04-06 File SHA256: 51eef32d5888a585bc83ffcf8f9a7de914896feb87776be9d1cea5e18c74545d Libraries used by CCExtractor Tesseract Version: 4.0.0 Leptonica Version: leptonica-1.76.0 libGPAC Version: 0.7.2-DEV zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.35 FreeType libhash nuklear libzvbi

dcjm avatar Apr 07 '20 12:04 dcjm

I do have a fix for the problem

If you have already written the code needed, there's no reason to not send a pull request even if we find a better solution.

NilsIrl avatar Apr 08 '20 18:04 NilsIrl

I was not able to reproduce this on master or v0.88 with stream1.ts.

Output
$ TESSDATA_PREFIX=/nix/store/ig0wnzsbs9j7jaa0bh36ybzl99x6838p-tesseract-3.05.00/share/ valgrind ./ccextractor files/stream1.ts
==14241== Memcheck, a memory error detector
==14241== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==14241== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==14241== Command: ./ccextractor files/stream1.ts
==14241==
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
Input: files/stream1.ts
[Extract: 1] [Stream mode: Autodetect]
[Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto]
[Timing mode: Auto] [Debug: No] [Buffer input: No]
[Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No]
[Target format: .srt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No]
[Add font color data: Yes] [Add font typesetting: Yes]
[Convert case: No][Filter profanity: No] [Video-edit join: No]
[Extraction start time: not set (from start)]
[Extraction end time: not set (to end)]
[Live stream: No] [Clock frequency: 90000]
[Teletext page: Autodetect]
[Start credits text: None]
[Quantisation-mode: CCExtractor's internal function]

-----------------------------------------------------------------
Opening file: files/stream1.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
100%  |  117:35
Number of NAL_type_7: 0
Number of VCL_HRD: 0
Number of NAL HRD: 0
Number of jump-in-frames: 0
Number of num_unexpected_sei_length: 0

Min PTS:				00:00:01:400
Max PTS:				01:57:36:679
Length:				 01:57:35:279
Done, processing time = 1084 seconds
Issues? Open a ticket here
https://github.com/CCExtractor/ccextractor/issues
==14241==
==14241== HEAP SUMMARY:
==14241==     in use at exit: 866,857 bytes in 848 blocks
==14241==   total heap usage: 74,683,343 allocs, 74,682,495 frees, 316,115,451,165 bytes allocated
==14241==
==14241== LEAK SUMMARY:
==14241==    definitely lost: 866,305 bytes in 847 blocks
==14241==    indirectly lost: 0 bytes in 0 blocks
==14241==      possibly lost: 0 bytes in 0 blocks
==14241==    still reachable: 552 bytes in 1 blocks
==14241==         suppressed: 0 bytes in 0 blocks
==14241== Rerun with --leak-check=full to see details of leaked memory
==14241==
==14241== For lists of detected and suppressed errors, rerun with: -s
==14241== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

NilsIrl avatar Apr 08 '20 20:04 NilsIrl

Did you try stream2.ts? This seems to show the problem quite quickly. As a further check I installed Debian Buster in a new virtual machine with the minimum necessary to build ccextractor. It processed stream1.ts without a problem. I have a feeling that only failed when I added explicit bounds checking assertions. The output for stream2.ts was:

Opening file: stream2.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
malloc_consolidate(): invalid chunk size
Aborted

Stream3.ts produced

Opening file: stream3.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
double free or corruption (!prev)
Aborted

dcjm avatar Apr 09 '20 16:04 dcjm

Well, I've tried with stream2.ts and stream3.ts but to no avail, nothing wrong reported by valgrind on my side.

when I added explicit bounds checking assertions

Could you send a diff/patch/branch of the checks you added?

NilsIrl avatar Apr 12 '20 21:04 NilsIrl

@dcjm what exact command line did you use?

cfsmp3 avatar Apr 12 '20 21:04 cfsmp3

The commit with both the assertions and the fix is here. The fix is the addition of lines 660 and 661 so that needs to be removed or commented out to trigger the assertions.

The command line was simply: ccextractor/linux/ccextractor stream2.ts

Which Linux platform do you use for testing? I may be able to run some comparisons. The issue could come down to the version of tesseract. This is what is installed on my testing virtual machine:

dpkg -a -l | grep tesser
ii  libtesseract-dev:amd64        4.0.0-2                     amd64        Development files for the tesseract command line OCR tool
ii  libtesseract4:amd64           4.0.0-2                     amd64        Tesseract OCR library
ii  tesseract-ocr                 4.0.0-2                     amd64        Tesseract command line OCR tool
ii  tesseract-ocr-eng             1:4.00~git30-7274cfa-1      all          tesseract-ocr language files for English
ii  tesseract-ocr-osd             1:4.00~git30-7274cfa-1      all          tesseract-ocr language files for script and orientation

dcjm avatar Apr 13 '20 15:04 dcjm

I couldn't reproduce (and I spent quite a bit of time today with those samples because they showed two memory leaks that took a bit to fix). Anyway, I've added those two lines.

Closing.

cfsmp3 avatar Mar 22 '23 04:03 cfsmp3