ccextractor
ccextractor copied to clipboard
[BUG] French DVB subtitles need deduplication
CCExtractor version: 0.85
In raising this issue, I confirm the following:
- [X] I have read and understood the contributors guide.
- [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
- [X] I have checked that the issue I'm posting isn't already reported.
- [X] I have checked that the issue I'm reporting isn't already solved and no duplicates exist in closed issues and in opened issues
- [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
My familiarity with the project is as follows:
- [X] I am an active contributor to CCExtractor.
Necessary information
- Is this a regression (did it work before)? [X] NO
- What platform did you use? [ ] Windows - [X] Linux - [ ] Mac
- What were the used arguments?
-datets -ttxt -UCLA -noru -utf8
**Video links ** http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.mpg http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.txt
Additional information CCExtractor-0.85 compiled 2017-07-29 with liblept4 succeeds in extracting DVB captions from the file above, as shown in the accompanying txt file (Chrome gets the encoding wrong and no longer has a way to correct it; in fact the file is UTF-8). (CCExtractor-0.86 and CCExtractor-0.87 fail to find any subtitles, see issue #1039.)
However, each line appears in part several times before it completes, and also at times partially repeats in the following line:
20170714110001.000|20170714110001.360|CC1|distribués gratuitement pour petits,
20170714110001.360|20170714110001.480|CC1|distribués, gratuitement pour petits et
20170714110001.480|20170714110001.880|CC1|distribués, gratuitement pour petits et grands,
20170714110001.880|20170714110002.280|CC1|distribués, gratuitement pour …
20170714110002.280|20170714110002.440|CC1|distribués, gratuitement pour petits et grands,, histoire que
20170714110002.440|20170714110002.840|CC1|petits et grands,, histoire que pe rd u re,
20170714110002.840|20170714110003.120|CC1|petits et grands,, histoire que pe rd u re, cette
20170714110003.120|20170714110003.400|CC1|petits et grands,, histoire que pe rd u re, cette a n n ée
20170714110003.400|20170714110003.800|CC1|petits et grands,, histoire que perdure, cette année encore,
20170714110003.800|20170714110003.880|CC1|petits et grands,, histoire que perdure, cette année encore, la
CCExtractor has solved this duplication problem in teletext; it's clearly also present in some DVB subtitles, notably the French network TF1.