marker icon indicating copy to clipboard operation
marker copied to clipboard

Marker skips chapter during text extraction

Open nekiee13 opened this issue 1 year ago • 3 comments

  1. Installed Marker from the dev branch, under Win11. For some reason it always skips complete Chapter 5 -> "V. Instructions, Procedures, and Drawings" Document attached: 10CFR50AppB_LibOff.pdf

  2. When executing script, I also get this warning message, but it seems that it does not cause any issues:

D:\PDF\vMarker2a\lib\site-packages\threadpoolctl.py:1214: RuntimeWarning: 
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at the same time. Both libraries are known to be incompatible and this can cause random crashes or deadlocks on Linux when loaded in the same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

  warnings.warn(msg, RuntimeWarning)

nekiee13 avatar Jun 02 '24 23:06 nekiee13

Probably the text for that chapter isn't in the PDF. If you set OCR_ALL_PAGES=true, does it do any differently?

VikParuchuri avatar Jun 03 '24 14:06 VikParuchuri

Set OCR to True and repeated. No change. Json confirms successful OCR. Attached Json log.

https://gist.github.com/nekiee13/43169c47126fd6f6d9f3de2438ead2dd

nekiee13 avatar Jun 04 '24 21:06 nekiee13

Looking at the position of Chapter Five, it's located closer to the bottom of the page, which might have caused the model to misidentify it as a footnote or footer, thus removing it from the final output.

myhloli avatar Jul 11 '24 16:07 myhloli