ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

Add flag for Page Segmentation Modes control

Open Neo2SHYAlien opened this issue 11 months ago • 9 comments

In raising this pull request, I confirm the following (please check boxes):

  • [X] I have read and understood the contributors guide.
  • [X] I have checked that another pull request for this purpose does not exist.
  • [X] I have considered, and confirmed that this submission will be valuable to others.
  • [X] I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • [X] I give this submission freely, and claim no ownership to its content.
  • [X] I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

  • [ ] I have never used CCExtractor.
  • [ ] I have used CCExtractor just a couple of times.
  • [X] I absolutely love CCExtractor, but have not contributed previously.
  • [ ] I am an active contributor to CCExtractor.

I added an flag -psm for controlling PSM (Page Segmentation Modes) in Tesseract. The default option (3) gives me quite bad results. When I use 6, 11, or 12 for Bulgarian, it gives me much better OCR results. I haven't tested other languages yet, but I expect improvements as well if other mode is used.

p.s This PR is continue #1544 which was closed after the rebase 🥲

Neo2SHYAlien avatar Mar 05 '24 12:03 Neo2SHYAlien

@cfsmp3 After the resync of the main branch previous PR #1544 was closed automatically. I hope the code change to be good enough I'm nod a daily dev 😊

Neo2SHYAlien avatar Mar 05 '24 12:03 Neo2SHYAlien

@prateekmedia have you added this flag already?

PunitLodha avatar Aug 12 '24 14:08 PunitLodha

@PunitLodha Not added yet, will add once this merges.

prateekmedia avatar Aug 12 '24 14:08 prateekmedia

@prateekmedia could you add it to this PR itself?

PunitLodha avatar Aug 13 '24 08:08 PunitLodha

@PunitLodha Here I have made a PR to his repo: https://github.com/Neo2SHYAlien/ccextractor/pull/1

prateekmedia avatar Aug 23 '24 10:08 prateekmedia

@prateekmedia merged

Neo2SHYAlien avatar Aug 23 '24 13:08 Neo2SHYAlien

The tests failing will be resolved in #1635. cc @PunitLodha

prateekmedia avatar Aug 23 '24 13:08 prateekmedia

@prateekmedia the tests aren't passing yet

PunitLodha avatar Sep 02 '24 20:09 PunitLodha

@PunitLodha This PR needs rebase again.

prateekmedia avatar Sep 02 '24 20:09 prateekmedia

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 1a13bbb...:

Report Name Tests Passed
Broken 12/13
CEA-708 9/14
DVB 4/7
DVD 3/3
DVR-MS 2/2
General 15/27
Hauppage 2/3
MP4 3/3
NoCC 10/10
Options 83/86
Teletext 21/21
WTV 9/13
XDS 22/34

All tests passing on the master branch were passed completely.

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:


Check the result page for more info.

ccextractor-bot avatar Sep 02 '24 21:09 ccextractor-bot

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 1a13bbb...:

Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 7/7
DVD 3/3
DVR-MS 2/2
General 27/27
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 85/86
Teletext 21/21
WTV 13/13
XDS 34/34

All tests passing on the master branch were passed completely.

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:


Check the result page for more info.

ccextractor-bot avatar Sep 02 '24 22:09 ccextractor-bot