ccextractor
ccextractor copied to clipboard
[FEAT] Add flag for Page Segmentation Modes control
In raising this pull request, I confirm the following (please check boxes):
- [X] I have read and understood the contributors guide.
- [X] I have checked that another pull request for this purpose does not exist.
- [X] I have considered, and confirmed that this submission will be valuable to others.
- [X] I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
- [X] I give this submission freely, and claim no ownership to its content.
- [X] I have mentioned this change in the changelog.
My familiarity with the project is as follows (check one):
- [ ] I have never used CCExtractor.
- [ ] I have used CCExtractor just a couple of times.
- [X] I absolutely love CCExtractor, but have not contributed previously.
- [ ] I am an active contributor to CCExtractor.
I added an flag -psm
for controlling PSM (Page Segmentation Modes) in Tesseract. The default option (3) gives me quite bad results. When I use 6, 11, or 12 for Bulgarian, it gives me much better OCR results. I haven't tested other languages yet, but I expect improvements as well if other mode is used.
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results:
Report Name | Tests Passed |
Broken | 13/13 |
CEA-708 | 2/14 |
DVB | 4/7 |
DVD | 3/3 |
DVR-MS | 2/2 |
General | 24/27 |
Hauppage | 3/3 |
MP4 | 3/3 |
NoCC | 10/10 |
Options | 85/87 |
Teletext | 21/21 |
WTV | 13/13 |
XDS | 30/34 |
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).
Your PR breaks these cases:
- ccextractor -autoprogram -out=srt -latin1 85271be4d2...
- ccextractor -autoprogram -out=ttxt -latin1 1974a299f0...
- ccextractor -autoprogram -out=ttxt -latin1 132d7df7e9...
- ccextractor -autoprogram -out=ttxt -latin1 99e5eaafdc...
- ccextractor -autoprogram -out=smptett -latin1 -ucla e274a73653...
- ccextractor -autoprogram -out=ttxt -xds -latin1 -ucla e274a73653...
- ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds b22260d065...
- ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 88cd42b89a...
- ccextractor -svc 1 -out=txt -nobom -noru ea83ff7bcb...
- ccextractor -svc 1 -out=txt f17524b53f...
- ccextractor -svc 1 -out=txt 80848c45f8...
- ccextractor -svc 1 -out=txt -nobom -noru b5d6aad89f...
- ccextractor -svc 1[EUC-KR] -out=txt -noru b5d6aad89f...
- ccextractor -svc 1 -out=srt da904de35d...
- ccextractor -svc 1 -out=sami da904de35d...
- ccextractor -svc 1[EUC-KR] b5d6aad89f...
- ccextractor -svc 1[EUC-KR] -noru b5d6aad89f...
- ccextractor -svc all da904de35d...
- ccextractor -svc all[EUC-KR] b5d6aad89f...
- ccextractor -svc 1,2[UTF-8],3[EUC-KR],54 -out=txt da904de35d...
- ccextractor -svc 1 c83f765c66...
- ccextractor --capfile /repository/Dictionary/MattS_dictionary.txt c83f765c66...
- ccextractor -stdout -quiet -nofc 79a51f3500...
- ccextractor -stdout -quiet -nofc 767b546f96...
Check the result page for more info.
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results:
Report Name | Tests Passed |
Broken | 13/13 |
CEA-708 | 2/14 |
DVB | 4/7 |
DVD | 3/3 |
DVR-MS | 2/2 |
General | 24/27 |
Hauppage | 3/3 |
MP4 | 2/3 |
NoCC | 10/10 |
Options | 77/87 |
Teletext | 21/21 |
WTV | 1/13 |
XDS | 26/34 |
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).
Your PR breaks these cases:
- ccextractor -autoprogram -out=srt -latin1 85271be4d2...
- ccextractor -autoprogram -out=ttxt -latin1 1974a299f0...
- ccextractor -autoprogram -out=ttxt -latin1 132d7df7e9...
- ccextractor -autoprogram -out=ttxt -latin1 99e5eaafdc...
- ccextractor -out=srt -latin1 f23a544ba8...
- ccextractor -out=srt -latin1 97cc394d87...
- ccextractor -out=srt -latin1 10f0f77cf4...
- ccextractor -out=srt -latin1 df3b4d62d3...
- ccextractor -out=srt -latin1 d7e7dbdf68...
- ccextractor -out=srt -latin1 76734ac4a7...
- ccextractor -out=srt -latin1 c791382c94...
- ccextractor -out=srt -latin1 f673b2f916...
- ccextractor -out=srt -latin1 da75bdee47...
- ccextractor -out=srt -latin1 bd6f33a669...
- ccextractor -out=srt -latin1 0e5e6b26be...
- ccextractor -out=srt -latin1 a226cc302d...
- ccextractor -autoprogram -out=smptett -latin1 -ucla e274a73653...
- ccextractor -autoprogram -out=ttxt -xds -latin1 -ucla e274a73653...
- ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds b22260d065...
- ccextractor -autoprogram -out=srt -latin1 -ucla b22260d065...
- ccextractor -autoprogram -out=ttxt -latin1 -xds -ucla c813e713a0...
- ccextractor -autoprogram -out=srt -latin1 -ucla c813e713a0...
- ccextractor -autoprogram -out=srt -latin1 -ucla c8dc039a88...
- ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 88cd42b89a...
- ccextractor -svc 1 -out=txt -nobom -noru ea83ff7bcb...
- ccextractor -svc 1 -out=txt f17524b53f...
- ccextractor -svc 1 -out=txt 80848c45f8...
- ccextractor -svc 1 -out=txt -nobom -noru b5d6aad89f...
- ccextractor -svc 1[EUC-KR] -out=txt -noru b5d6aad89f...
- ccextractor -svc 1 -out=srt da904de35d...
- ccextractor -svc 1 -out=sami da904de35d...
- ccextractor -svc 1[EUC-KR] b5d6aad89f...
- ccextractor -svc 1[EUC-KR] -noru b5d6aad89f...
- ccextractor -svc all da904de35d...
- ccextractor -svc all[EUC-KR] b5d6aad89f...
- ccextractor -svc 1,2[UTF-8],3[EUC-KR],54 -out=txt da904de35d...
- ccextractor -autoprogram -out=srt -latin1 -1 a65d39ccb3...
- ccextractor -svc 1 c83f765c66...
- ccextractor -out=txt c83f765c66...
- ccextractor -out=spupng c83f765c66...
- ccextractor -nogt c83f765c66...
- ccextractor --fixpadding c83f765c66...
- ccextractor -datastreamtype 2 c83f765c66...
- ccextractor -datastreamtype 2 -streamtype 2 c83f765c66...
- ccextractor --capfile /repository/Dictionary/MattS_dictionary.txt c83f765c66...
- ccextractor -in=es dc7169d7c4...
- ccextractor -autoprogram -out=srt -bom -latin1 8849331dda...
- ccextractor -stdout -quiet -nofc 79a51f3500...
- ccextractor -stdout -quiet -nofc 767b546f96...
Check the result page for more info.
@Neo2SHYAlien Could you please rebase? I'd like to merge this. Also @prateekmedia this could affect you since you were working on the rust param parsing.
@cfsmp3 I'll be more than happy. In the next days I'll update the code will push it 😊
@cfsmp3 I will handle this flag in my rust PR.