ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

DVB subtitles from China

Open cfsmp3 opened this issue 8 years ago • 8 comments

We have a nice sample in the video-samples repository from a Chinese station that comes with DVB subtitles in Chinese.

Chinese DVB/Yan Oi Tong Charity Show 2014 [Live] - High Definition Jade - 2014-10-18.ts

Anshul, can you take a preliminary look?

cfsmp3 avatar Sep 11 '15 07:09 cfsmp3

Link to the file: https://drive.google.com/open?id=0B_61ywKPmI0TZFNPWmVrbGk2WDA

Alternative URL: https://sampleplatform.ccextractor.org/sample/176

cfsmp3 avatar Nov 28 '16 19:11 cfsmp3

For students working on the Code In task, you will have to make sure that the DVB subtitle extraction system is working, along with having the necessary Chinese recognition data files. ('chi_sim' and 'chi_tra' for simplified and traditional, available at https://github.com/tesseract-ocr/tessdata). You need to place these files at the correct location (along with the English .traineddata files), and then call ccextractor with the parameters '-dvblang' and '-ocrlang'. (Read the help screen for details on those options). Let us know how it goes!

Abhinav95 avatar Nov 30 '16 04:11 Abhinav95

Hey, I can read Chinese. I'll give this one a look. Can't claim the task, though, because I'm currently doing another.

e: oh god 11gb i need to free up space on my drive

ghost avatar Dec 01 '16 10:12 ghost

The video is in Cantonese; the subtitles are in Traditional Chinese.

The result is not satisfying. Although some (~20%) of subtitles are extracted correctly, most of them are just random characters. The timing is not good as well. The video itself is also somehow damaged. CCExtractor crashes half-way with -ocrlang (it says "something messy"), but it works if I use parameter -out=spupng. It could be a bug of CCExtractor because if -out=spupng can work well with the video file, OCR should work too.

Example of bad OCR (Completely irrelevant. It seems that the only thing that matches is the number of the characters): Generated: 跩鯉頤鮨嵐噩圉胸囍武蓿儡蘑意凰 Correct: 以罐頭作為主題的菜式有什麼意見 correct

Example of good OCR (although it's not completely correct): Generated: 煎爛了嗎?那我屹掉不要浪賣 Correct: 煎爛了嗎?那我掉不要浪賣 inaccurate Generated by --out=spupng (The result is squeezed. It could be the reason why OCR works incorrectly): sub0005

Example of bad timing:

16
00:00:32,720 --> 00:00:34,079
來看看評判對第二組

This subtitle should start at 00:34, instead of 00:32

harrynull avatar Dec 10 '17 05:12 harrynull

@Abhinav95 What is the correct location to have the traineddata files in? I am running tesseract v4.1.0 (5.0.0 was in alpha stage so didn't know if it was a good idea or not to have that one). Anyways, I am on windows 10 x64, I have the Tesseract installation folder in path variable. I have the required traineddata files downloaded in tessdata folder inside the installation path.

I'm not sure what I am doing wrong, but ccextractor just gives me that it can't find the traineddata files. I was able to have the eng.traineddata detecting normally after creating a tessdata folder inside the CCextractor folder, but apparently it doesn't detect the other files.

RaXorX avatar Nov 19 '19 18:11 RaXorX

I'm going to merge all Chinese tasks here.

These two are very related so I'll be closing them: https://github.com/CCExtractor/ccextractor/issues/1379 https://github.com/CCExtractor/ccextractor/issues/918

cfsmp3 avatar Mar 22 '23 05:03 cfsmp3

I recently explored the GSoC 2024 projects and came across this issue regarding DVB subtitles from China. I noticed that there have been challenges with the accuracy of Tesseract for Chinese character recognition. I'd like to suggest considering PaddleOCR as a potential alternative. PaddleOCR is a multi-language OCR toolkit that leverages deep learning, and it's actively maintained by Baidu. It has shown to be particularly effective for Chinese text recognition.

I have practical experience with PaddleOCR; I've successfully used it to extract text from Chinese and Japanese textbooks with high accuracy. The toolkit is user-friendly and straightforward to implement.

Would the integration of PaddleOCR be something the team is willing to consider? Please let me know your thoughts on this proposal.

esp0r avatar Feb 20 '24 15:02 esp0r

Sure. I've never heard of it personally but that sounds precisely the problem, we don't have any knowledge of the Chinese ecosystem.

On Tue, Feb 20, 2024, 07:22 esp0r @.***> wrote:

I recently explored the GSoC 2024 projects and came across this issue regarding DVB subtitles from China. I noticed that there have been challenges with the accuracy of Tesseract for Chinese character recognition. I'd like to suggest considering PaddleOCR https://github.com/PaddlePaddle/PaddleOCR as a potential alternative. PaddleOCR is a multi-language OCR toolkit that leverages deep learning, and it's actively maintained by Baidu. It has shown to be particularly effective for Chinese text recognition.

I have practical experience with PaddleOCR; I've successfully used it to extract text from Chinese textbooks with high accuracy. The toolkit is user-friendly and straightforward to implement.

Would the integration of PaddleOCR be something the team is willing to consider? Please let me know your thoughts on this proposal.

— Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/224#issuecomment-1954451180, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNMTWNJOXCYAK2TSGJWOQDYUS5R5AVCNFSM4BPOIQRKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJVGQ2DKMJRHAYA . You are receiving this because you authored the thread.Message ID: @.***>

cfsmp3 avatar Feb 20 '24 15:02 cfsmp3