ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

[BUG] French DVB subtitles need deduplication

Open Liontooth opened this issue 5 years ago • 0 comments

CCExtractor version: 0.85

In raising this issue, I confirm the following:

  • [X] I have read and understood the contributors guide.
  • [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • [X] I have checked that the issue I'm posting isn't already reported.
  • [X] I have checked that the issue I'm reporting isn't already solved and no duplicates exist in closed issues and in opened issues
  • [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.

My familiarity with the project is as follows:

  • [X] I am an active contributor to CCExtractor.

Necessary information

  • Is this a regression (did it work before)? [X] NO
  • What platform did you use? [ ] Windows - [X] Linux - [ ] Mac
  • What were the used arguments? -datets -ttxt -UCLA -noru -utf8

**Video links ** http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.mpg http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.txt

Additional information CCExtractor-0.85 compiled 2017-07-29 with liblept4 succeeds in extracting DVB captions from the file above, as shown in the accompanying txt file (Chrome gets the encoding wrong and no longer has a way to correct it; in fact the file is UTF-8). (CCExtractor-0.86 and CCExtractor-0.87 fail to find any subtitles, see issue #1039.)

However, each line appears in part several times before it completes, and also at times partially repeats in the following line:

20170714110001.000|20170714110001.360|CC1|distribués gratuitement pour petits,
20170714110001.360|20170714110001.480|CC1|distribués, gratuitement pour petits et
20170714110001.480|20170714110001.880|CC1|distribués, gratuitement pour petits et grands,
20170714110001.880|20170714110002.280|CC1|distribués, gratuitement pour …
20170714110002.280|20170714110002.440|CC1|distribués, gratuitement pour petits et grands,, histoire que
20170714110002.440|20170714110002.840|CC1|petits et grands,, histoire que pe rd u re,
20170714110002.840|20170714110003.120|CC1|petits et grands,, histoire que pe rd u re, cette
20170714110003.120|20170714110003.400|CC1|petits et grands,, histoire que pe rd u re, cette a n n ée
20170714110003.400|20170714110003.800|CC1|petits et grands,, histoire que perdure, cette année encore,
20170714110003.800|20170714110003.880|CC1|petits et grands,, histoire que perdure, cette année encore, la

CCExtractor has solved this duplication problem in teletext; it's clearly also present in some DVB subtitles, notably the French network TF1.

Liontooth avatar Nov 18 '18 19:11 Liontooth