ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

[BUG] French DVB subtitles stopped working

Open Liontooth opened this issue 5 years ago • 10 comments

Please prefix your issue with one of the following: [BUG], [PROPOSAL], [QUESTION].

CCExtractor version (using the --version parameter preferably) : ccextractor-0.87

In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):

  • [X ] I have read and understood the contributors guide.
  • [X] I have checked that the issue I'm posting isn't already reported.
  • [X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
  • [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
  • [X] I have used the latest available version of CCExtractor to verify this issue exists.

My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):

  • [X] I am an active contributor to CCExtractor.

Necessary information

  • Is this a regression (did it work before)? [ ] NO | [X] YES - last known working version 0.85
  • What platform did you use? [ ] Windows - [X] Linux - [ ] Mac
  • What were the used arguments? -pn 257 -tpage 888 -datets -ttxt -UCLA -noru -utf8 -parsepat -parsepmt

Video links

http://vrnewsscape.ucla.edu/dropbox/2017-07-24_1800_FR_FR2_Journal_20h00.mpg http://vrnewsscape.ucla.edu/dropbox/2017-07-24_1800_FR_FR2_Journal_20h00.txt

Additional information

CCExtractor-0.85 compiled 2017-07-29 with liblept4 succeeds in extracting DVB captions from the file above, as shown in the accompanying txt file.

CCExtractor-0.86 and CCExtractor-0.87 fail to find any subtitles.

Liontooth avatar Nov 18 '18 07:11 Liontooth

Hey,

Can you please provide output of ./ccextractor --version I want to check which tesseract and leptonica version you are using.

Also can you let me know, that spupng output format works for you or not?

I was able to get subs out of your file, though it is not that accurate:

19700101000000.000|19700101000002.080|fra|DVB|RTL R ^M
19700101000002.080|19700101000002.320|fra|DVB|LOTS OUTRE CITE ETC ^M
19700101000019.560|19700101000022.440|fra|DVB|- Nous préservons l'eau
DETTE RU CR ^M
19700101000022.440|19700101000024.480|fra|DVB|STE CEE ^M
19700101000024.480|19700101000026.560|fra|DVB|DOTE ET Le
RTE RTE ^M
19700101000026.560|19700101000028.520|fra|DVB|- Pour stopper La progression ^M
19700101000028.520|19700101000030.560|fra|DVB|DURD'ERECUNEREN ^M
19700101000030.560|19700101000032.640|fra|DVB|Ru CEE Ce ES
EC ^M
19700101000032.640|19700101000034.680|fra|DVB|ER ROIS [UT ER ^M
19700101000034.680|19700101000037.320|fra|DVB|Quand c'est LUCE
d'avancer plus Loin, 6 Canadair ^M
19700101000037.320|19700101000042.600|fra|DVB|LATTLu Tu UE tL re [NS ^M
19700101000042.600|19700101000045.080|fra|DVB|Aujourd'hui, c'est tout Le Sud-Est
qui a été frappé ^M

Following is my version information:

anshul@anshul-desktop:~/Project/Multimedia/ccextractor/build$ ./ccextractor --version
CCExtractor 0.87, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.87
        Git commit: 38fc6e56234f9b7bac9945bd7080d549726a2031
        Compilation date: 2018-11-18
        File SHA256: 49efa2d1b0288cdb7ec4face40a106386e8c988c408fc3e51d7d986266f66908
Libraries used by CCExtractor
        Tesseract Version: 4.0.0
        Leptonica Version: leptonica-1.75.3
        libGPAC Version: 0.7.2-DEV
        zlib: 1.2.11
        utf8proc Version: 2.2.0
        protobuf-c Version: 1.1.1
        libpng Version: 1.6.35
        FreeType 
        libhash
        nuklear
        libzvbi

anshul1912 avatar Nov 18 '18 13:11 anshul1912

I did not understand why you cross referenced, do you want to say that only problem you have is duplication?

anshul1912 avatar Nov 19 '18 01:11 anshul1912

@anshul1912 No, the duplication problem is reported through 0.85 since @Liontooth were not able to extract French DVB in 0.86 and 0.87. 🙂 He has mentioned this in additional information and hence cross referenced this issue.

From #1040 :

CCExtractor-0.85 compiled 2017-07-29 with liblept4 succeeds in extracting DVB captions from the file above, as shown in the accompanying txt file. (CCExtractor-0.86 and CCExtractor-0.87 fail to find any subtitles, see issue #1039.)

saurabhshri avatar Nov 22 '18 14:11 saurabhshri

Hi -- sorry to be slow. Anshul, your attempt to extract the text demonstrates the regression. Version 0.85 does a great job -- close to perfect (Chrome no longer lets you set the character set and gets this one wrong; in fact it's UTF-8):

20170714180000.000|20170714180002.080|fra|DVB|cette route. 
20170714180002.080|20170714180002.320|fra|DVB|Tout se joue à la seconde près. 
20170714180019.560|20170714180022.440|fra|DVB|- Nous préservons l'eau parce que nous en avons très peu 
20170714180022.440|20170714180024.480|fra|DVB|sur le dispositif. 
20170714180024.480|20170714180026.560|fra|DVB|Nous arrosons au moment le plus opportun. 
20170714180026.560|20170714180028.520|fra|DVB|- Pour stopper la progression du feu... 
20170714180028.520|20170714180030.560|fra|DVB|- Il n'y a pas plus, là? 
20170714180030.560|20170714180032.640|fra|DVB|- Les pompiers sont obligés de s'enfoncer 
20170714180032.640|20170714180034.680|fra|DVB|sur des terrains accidentés. 
20170714180034.680|20170714180037.320|fra|DVB|Quand c‘est impossible d‘avancer plus loin, 6 Canadair 
20170714180037.320|20170714180042.600|fra|DVB|Viennent en renfort. 
20170714180042.600|20170714180045.080|fra|DVB|Aujourd'hui, c‘est tout le Sud—Est qui a été frappé

In comparison, your attempt shows 0.87 gets almost nothing right. So this is a clear regression.

The version I run doesn't show a lot of information:

./ccextractor-0.85e --version
CCExtractor 0.85, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.85
        Git commit: Unknown
        Compilation date: 2017-07-29

Back then, the --version flag was still not fully supported. The downloaded file is dated 19 Jan 2017. I no longer have the version information for tesseract and leptonica (other than that it's liblept4); let me know if you'd like the binary. Strace might tell you what it's using.

It would really be a pity to lose this excellent functionality! This issue was more or less completely solved, so let's try to get back to 0.85.

Cheers, David

Liontooth avatar Nov 29 '18 00:11 Liontooth

Hi David,

I see there is problem with quantization, I see output is fine if quantization is disabled.

19700101000000.000|19700101000002.080|fra|DVB|cette route. ^M
19700101000002.080|19700101000002.320|fra|DVB|Tout se joue à la seconde près. ^M
19700101000019.560|19700101000022.440|fra|DVB|- Nous préservons l'eau
parce que nous en avons très peu ^M
19700101000022.440|19700101000024.480|fra|DVB|SR EE ^M
19700101000024.480|19700101000026.560|fra|DVB|NOTÉE TEE EL
le plus opportun. ^M
19700101000026.560|19700101000028.520|fra|DVB|- Pour stopper la progression ^M
19700101000028.520|19700101000030.560|fra|DVB|- ILn'y a pas plus, Là? ^M
19700101000030.560|19700101000032.640|fra|DVB|- Les pompiers sont obligés
de s'enfoncer ^M
19700101000032.640|19700101000034.680|fra|DVB|ST RQEIROE Ce (UEER ^M
19700101000034.680|19700101000037.320|fra|DVB|Quand c'est [LES
d'avancer plus loin, 6 Canadair ^M
19700101000037.320|19700101000042.600|fra|DVB|viennent en renfort. ^M
19700101000042.600|19700101000045.080|fra|DVB|Aujourd'hui, c'est tout Le Sud-Est
qui a été frappé ^M

I ran ccextractor like following /ccextractor ~/Videos/Samples/DVB/2017-07-24_1800_FR_FR2_Journal_20h00.mpg -quant 0 -pn 257 -tpage 888 -datets -ttxt -UCLA -noru -utf8 -parsepat -parsepmt -o a.txt can you confirm that -quant 0 work perfectly for you in 0.87

anshul1912 avatar Dec 09 '18 11:12 anshul1912

Only starting output is fine, complete output is still bad, looks like latest fra trained data is bad compared to older one

anshul1912 avatar Dec 09 '18 13:12 anshul1912

when I tried to compare with 0.85, my output file was completely empty.

CCExtractor 0.85, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
Input: /home/anshul/Videos/Samples/DVB/2017-07-24_1800_FR_FR2_Journal_20h00.mpg
[Extract: 1] [Stream mode: Autodetect]
[Program : 257 ] [Hauppage mode: No] [Use MythTV code: Auto]
[Timing mode: Auto] [Debug: No] [Buffer input: No]
[Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No]
[Target format: .txt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No]
[Add font color data: Yes] [Add font typesetting: Yes]
[Convert case: No] [Video-edit join: No]
[Extraction start time: not set (from start)]
[Extraction end time: not set (to end)]
[Live stream: No] [Clock frequency: 90000]
[Teletext page: 888]
[Start credits text: None]

-----------------------------------------------------------------
Opening file: /home/anshul/Videos/Samples/DVB/2017-07-24_1800_FR_FR2_Journal_20h00.mpg
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
Read PAT packet (id: 0) ts-id: 0x0001
  section length: 13  number: 0  last: 0
  version_number: 0  current_next_indicator: 1

Program association section (PAT)
  Program number: 257  -> PMTPID: 110
This PID (110) is a PMT for program 257.
   120 |  1B ( 27) | H.264 video
   130 |   6 (  6) | MPEG-2 private data
   131 |   6 (  6) | MPEG-2 private data
   132 |   6 (  6) | MPEG-2 private data
   140 |   6 (  6) | MPEG-2 private data
   142 |   6 (  6) | MPEG-2 private data
---
Creating a.txt
100%  |  335:42
Number of NAL_type_7: 0
Number of VCL_HRD: 0
Number of NAL HRD: 0
Number of jump-in-frames: 0
Number of num_unexpected_sei_length: 0

Min PTS:                                00:00:00:000
Max PTS:                                05:35:42:551
Length:                          05:35:42:551
Done, processing time = 46 seconds
Issues? Open a ticket here
https://github.com/CCExtractor/ccextractor/issues
**No captions were found in input.**

anshul1912 avatar Dec 09 '18 14:12 anshul1912

I've experienced a very similar issue to this with DVB subtitles from British TV. Using v0.87 newly built on Ubuntu with tesseract 4.0.0 I get the No captions were found in input. error. Previously using v0.84 built against an older version of tesseract the subtitles were converted to srt almost perfectly.

What I discovered through trial and error is this seems to be an issue with the newer version of tesseract. Tesseract has various data files containing trained data for different languages here: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files. As that page says, the tessdata_fast file listed under the "September 15 2017" section is what is installed by default. if i instead install a language file from the plain "tessdata" folder under that section, or a file from the section "November 29, 2016", then ccextractor works as expected.

I'm uncertain why this works. Tesseract 4.0.0 has a newer "LSTM" engine which could be part of the problem, but testing with different combinations of data files and forcing different engines gave conflicting results. Some combinations when using LSTM also gave extremely bad detection for some sentences, e.g the second line should say "you never actually came here":

13
00:01:40,652 --> 00:01:43,511
<font color="#ffff00">You know, throughout everything,</font>
<font color="#ffff00">IEEE EOE CECE ECE</font>

Ultimately using the plain tessdata file from https://github.com/tesseract-ocr/tessdata seems to work.

Having said all that, the changelog for ccextractor v0.88 says - New: Add support for tesseract 4.0 so maybe we shouldn't expect it to work properly in 0.87. I do get other issues using ccextractor from the latest git though, so for now using the alternative tessdata in 0.87 seems to be the solution.

thunderbolt-tom avatar Jan 24 '19 00:01 thunderbolt-tom

Actually ccextractor v0.87 can compile with tesseract 4.0 & leptonica 1.77.0 Just use libpng 1.6.34 for compile.

ggnull35 avatar Feb 26 '19 09:02 ggnull35

@Liontooth Can you provide updated samples? Can't download that one. We're cleaning up issues now (overdue, I know).

cfsmp3 avatar Nov 21 '21 18:11 cfsmp3

Closing due to no samples

cfsmp3 avatar Mar 22 '23 06:03 cfsmp3