marker
marker copied to clipboard
Extraction from 2 column text, marker mixes left and right colum text paragraphs.
I've installed marker using wsl under Win11. Tested it in English, Croatian an Slovenian - it makes perfect job removing headers and footers, sidelines etc.
It struggles with :
- text bullets (I can fix that easily by hand or with regex if required) and
- with 2 column text - it mixes paragraphs from left and right column - not easily resolvable (reading unsolvable, for me at least).
Do you have any idea why is this happening or how to fix it?
I attached Test pdf file, in case you want to check output yourself. pdfTestEN.pdf Test pdf file is created with LibreOffice & Microsoft Print to Pdf.
Thanks for the test case. I'll look into this. The ordering model should figure out the column count, but it may be misclassifying these pages.
Just to add, in some instances (not always) some paragraphs are also omitted from output md.
Update: I did some more testing, and it seems that the issue with LayoutLMv3 is related to pdf format protocol. With pdf 2.0 it seems to be working much better (fine/correct). "MS Print to pdf" uses pdf 1.7. I'll do some more testing, but it seems that pdf format is causing the issue. Possible?
I honestly haven't looked into PDF format versions, so I don't know if that is it. That's an interesting find, though.
It may also be related to whether the data in the pdf was scanned/ocred or if it is a "digital" pdf created along with the text. Digital PDFs would have higher quality bounding boxes and images.
I'm currently working on a better way to detect columns that might be useful for this
same issue here.
Same problem on a 2-column pdf:
- with 2 column text - it mixes paragraphs from left and right column
Hi, I have also the same problem. Here is the link to the pdf files:
- https://www.datenschutz.rlp.de/fileadmin/lfdi/Dokumente/Orientierungshilfen/DSK_KPNr_2_Sanktionen.pdf
- https://www.datenschutz.rlp.de/fileadmin/lfdi/Dokumente/Orientierungshilfen/DSK_KPNr_3_Werbung.pdf
The other 18 pdf files listed here https://www.datenschutz.rlp.de/de/themenfelder-themen/datenschutz-grundverordnung/kurzpapiere-zur-auslegung-der-ds-gvo/ might also have the same issues. Thanks.
Same problem on a 2-column pdf:
with 2 column text - it mixes paragraphs from left and right column
any news about this issue?
This should be fixed in the new version (coming in the next couple of weeks).
Great 👍
On Fri, 3 May 2024, 07:07 Vik Paruchuri, @.***> wrote:
This should be fixed in the new version (coming in the next couple of weeks).
— Reply to this email directly, view it on GitHub https://github.com/VikParuchuri/marker/issues/50#issuecomment-2092213054, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB2KT6MRQRUL5TSI7BYRIC3ZAMLRXAVCNFSM6AAAAABBBXI36CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJSGIYTGMBVGQ . You are receiving this because you authored the thread.Message ID: @.***>
Try the dev branch if you're having issues - there is better ordering implemented there, but still have to test more before merging.