ccextractor icon indicating copy to clipboard operation
ccextractor copied to clipboard

Sad situation with Windows + OCR

Open cfsmp3 opened this issue 4 years ago • 13 comments

While testing a previous ticket regarding hardsubx on Windows, on master. Running this exact version, just compiled:

CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.88
        Git commit: Unknown
        Compilation date: Unknown
        File SHA256: 0a40241ddd609f5272f063d25e0f2c29c2192187aabd2592da98909463b88541
Libraries used by CCExtractor
        Tesseract Version: 4.00.00dev
        Leptonica Version: leptonica-1.74 (Dec 31 2016, 12:28:35) [MSC v.1900 LIB Debug x86]
        libGPAC Version: 0.7.2-DEV
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.35
        FreeType
        libhash
        nuklear
        libzvbi

First, the reports, as usual about eng.traineddata couldn't suck more.

CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
eng.traineddata not found! No Switching Possible

Seriously, would it kill us to tell the user WHERE we expect that file to be present?

OK So since I didn't remember how this worked at all I started looking into the code a bit. We do look TESSDATA_PREFIX amount other places /usr/share. Wait what? This is Windows! Why are we looking there? Also I see lots of / as path separator, but Windows uses . Is this portable at all?

OK, so I set set the env variable:

set TESSDATA_PREFIX=C:\Downloads

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>dir c:\Downloads\tessdata
 Volume in drive C has no label.
 Volume Serial Number is 3A55-62AE

 Directory of c:\Downloads\tessdata

12-Apr-20  14:47    <DIR>          .
12-Apr-20  14:47    <DIR>          ..
12-Apr-20  14:46        23,466,654 eng.traineddata
               1 File(s)     23,466,654 bytes
               2 Dir(s)  92,672,598,016 bytes free

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
eng.traineddata not found! No Switching Possible

Still not working. Problem now is that I'm missing a \ at the end of the end variable.

OK so let's set it correct:

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>set TESSDATA_PREFIX=C:\Downloads\

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
lstm_recognizer_->DeSerialize(tessdata_manager.swap(), &fp):Error:Assert failed:in file C:\Users\HOME\.cppan\storage\src\42\9e\ba91\ccmain\tessedit.cpp, line 202

So now apparently it starts at least, but then it crashes.

We just need to work on OCR + Windows.

In my opinion, at the very least:

  1. Proper information to the user, including which paths are being searched. And where do the errors come from? Is it tesseract, or us? Are we bailing out before even giving tesseract a try?
  2. Update tesseract 4 to last version OR downgrade to 3. But using 4.00.00 is ridiculous! It's buggy.
  3. Check if there's an officially compiled binary we can use. I remember we did our own thing a long time ago. Still needed?

Labelling HARD because we seem to be unable to fix it once and for all.

cc: @ShraxO1

cfsmp3 avatar Apr 12 '20 21:04 cfsmp3

#1170

NilsIrl avatar Apr 12 '20 22:04 NilsIrl

Jokes aside, try PR #1170 on windows, might solve the problem. Also play around with deleting the data directory. You might fall on a message by tesseract that says where it looked for the data and didn't find it. If this message is not to your liking, it can be modified/suppressed using the dup2 syscall.

EDIT: tesseract may have an API which wouldn't require the use of dup2.

NilsIrl avatar Apr 12 '20 22:04 NilsIrl

Issue might be the result of not-properly builded solution. The issue should not appear if the solution is built properly. I checked it within VS2015 and VS2019 (default SDKs are used) and have not faced out such kind of issue.

apovalyaev avatar Apr 23 '20 19:04 apovalyaev

We just need to work on OCR + Windows.

In my opinion, at the very least:

1. Proper information to the user, including which paths are being searched. And where do the errors come from? Is it tesseract, or us? Are we bailing out before even giving tesseract a try?

2. Update tesseract 4 to last version OR downgrade to 3. But using 4.00.00 is ridiculous! It's buggy.

3. Check if there's an officially compiled binary we can use. I remember we did our own thing a long time ago. Still needed?

Labelling HARD because we seem to be unable to fix it once and for all.

cc: @ShraxO1

Now after "Update VS project build settings" we can use the following steps (which automatically takes the last version of tesseract, as for now it is tesseract-4.1.1)

Build steps which use last version of Tesseract:

  1. Clone repository https://github.com/CCExtractor/ccextractor
  2. Setting up vcpkg: 2.1) git clone https://github.com/Microsoft/vcpkg.git > cd vcpkg PS> .\bootstrap-vcpkg.bat 2.2) Modify vcpkg/triplets/x86-windows.cmake set(VCPKG_CRT_LINKAGE static) set(VCPKG_LIBRARY_LINKAGE static)
  3. Installing the last verified version of tesseract NOTE: Now it is tesseract-4.1.1 vcpkg install tesseract:x86-windows vcpkg integrate install
  4. Building the solution

So, further steps:

  1. It make sense to update auto-build scripts, so that auto-build takes also the last verified version of tesseract;
  2. Something else?

apovalyaev avatar Apr 26 '20 07:04 apovalyaev

I'd say there's something missing here. I followed your instructions, no errors (good), but the binary is still using tessearct-4.00dev, which makes sense - why would it pick any other version if that's the one we have inside the project?

cfsmp3 avatar Apr 26 '20 17:04 cfsmp3

Let's check if we are on the same page:

  1. Per my understanding, we have only got ffmpeg libraries precompiled inside project (directory windows/libs/lib/); So, when just being cloned, ccextractor should not be built unless there is already some other "copy" of tesseract library which is installed not through vcpkg. If the project compiled fine before you issued "vcpkg install tesseract:x86-windows" command, it means you have already installed some other copy of tesseract. It makes sense to remove it;
  2. The other possible reason is what particular version of tesseract vcpkg has installed. You can use command "vcpkg list" to check what version of tesseract you have installed on your PC;

apovalyaev avatar Apr 26 '20 18:04 apovalyaev

@apovalyaev Tesseract is included in those "cppan" dependencies; refer to https://github.com/CCExtractor/ccextractor/tree/master/windows/libs/lib/release-lib

Refer to https://github.com/CCExtractor/ccextractor/pull/592 for the PR, and maybe @Izaron could explain a bit if needed?

canihavesomecoffee avatar Apr 26 '20 18:04 canihavesomecoffee

@apovalyaev Tesseract is included in those "cppan" dependencies; refer to https://github.com/CCExtractor/ccextractor/tree/master/windows/libs/lib/release-lib

Refer to #592 for the PR, and maybe @Izaron could explain a bit if needed?

I've taken a look #592 to see that tesseract was manually compiled.

  1. #592 was closed at the beginning of 2017, so it might be outdated a little bit;
  2. From the other hand, it looks like there is a bug in cppan dependencies (which was discovered while running with "-hardsubx" option;

I can see two ways:

  1. Remove "ccpan" dependencies to see how it will work within vcpkg;
  2. Rebuild "ccpan" libraries with a newer version of tesseract; Some others? What would be the best fit?

apovalyaev avatar Apr 26 '20 19:04 apovalyaev

I can see two ways:

  1. Remove "ccpan" dependencies to see how it will work within vcpkg;
  2. Rebuild "ccpan" libraries with a newer version of tesseract; Some others? What would be the best fit?

Both solutions are OK. Personally I favor "the least required steps when starting from scratch".

As a developer, I prefer not having to install a lot of things to build something for the first time. That makes me more likely to contribute to a project than if I have to install a whole toolchain to get to a binary.

As a end-user, we should strive to provide a self-contained .msi that includes any library we use. Possibly including tesseract DLLs (so the user can replace them with new versions if he wants) would be better than statically linking tesseract.

cppan might have been the most convenient thing when it was added 3 years ago; it might not be the best solution today.

Since you are doing it, I'd say do whatever you prefer that works. If you get CCExtractor to report 4.11 (or whatever the current version is) and actually work, that's a better situation than what we have now.

@canihavesomecoffee is doing the GH actions integration (so we can get a full binary from GH, instead of me manually building releases) It would be great to have this working again.

cfsmp3 avatar Apr 26 '20 23:04 cfsmp3

To make things work automatically, it should provide both tesseract-ocr libraries and tess-data compatible (this is what this issue is about). Hence, when building solution/package, it needs to (A) Replace outdated "ccpan" libraries within a newly rebuilt versions; (B) Add tessdata directory to git clone https://github.com/ccextractor repository;

As for Step (A)... Below are the steps to make the project using vcpkg supplied packages instead of precompiled "ccpan's" (in other words, all the libraries from directories in windows\libs\lib\release-lib and windows\libs\lib\debug-lib)

It is all about "Debug-Full" and "Release-Full" build modes:

  1. Remove all files from directories: windows\libs\lib\release-lib windows\libs\lib\debug-lib and update additional libraries project settings to remove appropriate library dependencies (those "ccpan" libraries)
  2. Issue the following command to stop VS automatically linking libraries supplied by vcpkg: vcpkg integrate remove Then: 2.1) vcpkg export --zip tesseract:x86-windows NOTE: of course, it is assume that appropriate packages are already installed (see vcpkg commands mentioned previously) This command automatically creates a .zip-achive including all the appropriate .lib files. The name of this archive will be something "vcpkg-export-....zip" (this name can be extracted from vcpkg command output). 2.2) Extract the archive to some appropriate location: vcpkg-export-20200427-142748\installed\x86-windows\lib 2.3) Copy all libraries from "installed\x86-windows\lib" subdirectory to ccextractor windows\libs\lib\release-lib (for release). The same things for debug ... ${vcpkg-export-directory}\installed\x86-windows\debug\lib -> ccextractor\windows\libs\lib\debug-lib 2.4) Update project "additional libraries" settings accordingly.

I will prepare a pull request within: (1) newly rebuild libraries (replace of old "ccpan's); (2) added tessdata subdirectory to ccextractor project.

apovalyaev avatar Apr 27 '20 12:04 apovalyaev

If TESSDATA_PREFIX isn't set, the program will just look into its root folder. And once you throw in the age appropriate models you are good. Not a big deal really.

The problem if any is that you crash badly and without explanations after "FFMpeg Media Information".

mirh avatar Mar 28 '22 21:03 mirh

@cfsmp3 What do you think of this as we already have windows build system and CI fixed.

prateekmedia avatar Mar 16 '23 16:03 prateekmedia

@cfsmp3 What do you think of this as we already have windows build system and CI fixed.

I think we're still missing the issue with the trained data file. If it's not found, rather that "Not found!" it should say:

"Not found. I looked in these directories: [ xxxx, xxxx, xxxx ]"

cfsmp3 avatar Mar 17 '23 01:03 cfsmp3