ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

[FEAT] Add flag for Page Segmentation Modes control

Open Neo2SHYAlien opened this issue 1 year ago • 2 comments

In raising this pull request, I confirm the following (please check boxes):

  • [X] I have read and understood the contributors guide.
  • [X] I have checked that another pull request for this purpose does not exist.
  • [X] I have considered, and confirmed that this submission will be valuable to others.
  • [X] I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • [X] I give this submission freely, and claim no ownership to its content.
  • [X] I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

  • [ ] I have never used CCExtractor.
  • [ ] I have used CCExtractor just a couple of times.
  • [X] I absolutely love CCExtractor, but have not contributed previously.
  • [ ] I am an active contributor to CCExtractor.

I added an flag -psm for controlling PSM (Page Segmentation Modes) in Tesseract. The default option (3) gives me quite bad results. When I use 6, 11, or 12 for Bulgarian, it gives me much better OCR results. I haven't tested other languages yet, but I expect improvements as well if other mode is used.

Neo2SHYAlien avatar Jun 29 '23 10:06 Neo2SHYAlien

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results:

Report Name Tests Passed
Broken 13/13
CEA-708 2/14
DVB 4/7
DVD 3/3
DVR-MS 2/2
General 24/27
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 85/87
Teletext 21/21
WTV 13/13
XDS 30/34

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Your PR breaks these cases:


Check the result page for more info.

ccextractor-bot avatar Aug 27 '23 07:08 ccextractor-bot

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results:

Report Name Tests Passed
Broken 13/13
CEA-708 2/14
DVB 4/7
DVD 3/3
DVR-MS 2/2
General 24/27
Hauppage 3/3
MP4 2/3
NoCC 10/10
Options 77/87
Teletext 21/21
WTV 1/13
XDS 26/34

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Your PR breaks these cases:


Check the result page for more info.

ccextractor-bot avatar Aug 27 '23 09:08 ccextractor-bot

@Neo2SHYAlien Could you please rebase? I'd like to merge this. Also @prateekmedia this could affect you since you were working on the rust param parsing.

cfsmp3 avatar Mar 03 '24 17:03 cfsmp3

@cfsmp3 I'll be more than happy. In the next days I'll update the code will push it 😊

Neo2SHYAlien avatar Mar 03 '24 18:03 Neo2SHYAlien

@cfsmp3 I will handle this flag in my rust PR.

prateekmedia avatar Mar 04 '24 18:03 prateekmedia