marker icon indicating copy to clipboard operation
marker copied to clipboard

Extraction from 2 column text, marker mixes left and right colum text paragraphs.

Open nekiee13 opened this issue 1 year ago • 10 comments

I've installed marker using wsl under Win11. Tested it in English, Croatian an Slovenian - it makes perfect job removing headers and footers, sidelines etc.

It struggles with :

  • text bullets (I can fix that easily by hand or with regex if required) and
  • with 2 column text - it mixes paragraphs from left and right column - not easily resolvable (reading unsolvable, for me at least).

Do you have any idea why is this happening or how to fix it?

I attached Test pdf file, in case you want to check output yourself. pdfTestEN.pdf Test pdf file is created with LibreOffice & Microsoft Print to Pdf.

nekiee13 avatar Dec 24 '23 19:12 nekiee13

Thanks for the test case. I'll look into this. The ordering model should figure out the column count, but it may be misclassifying these pages.

VikParuchuri avatar Dec 28 '23 02:12 VikParuchuri

Just to add, in some instances (not always) some paragraphs are also omitted from output md.

nekiee13 avatar Dec 29 '23 14:12 nekiee13

Update: I did some more testing, and it seems that the issue with LayoutLMv3 is related to pdf format protocol. With pdf 2.0 it seems to be working much better (fine/correct). "MS Print to pdf" uses pdf 1.7. I'll do some more testing, but it seems that pdf format is causing the issue. Possible?

nekiee13 avatar Jan 03 '24 01:01 nekiee13

I honestly haven't looked into PDF format versions, so I don't know if that is it. That's an interesting find, though.

It may also be related to whether the data in the pdf was scanned/ocred or if it is a "digital" pdf created along with the text. Digital PDFs would have higher quality bounding boxes and images.

VikParuchuri avatar Jan 04 '24 00:01 VikParuchuri

I'm currently working on a better way to detect columns that might be useful for this

VikParuchuri avatar Jan 04 '24 00:01 VikParuchuri

same issue here.

paulcx avatar Jan 17 '24 08:01 paulcx

Same problem on a 2-column pdf:

  • with 2 column text - it mixes paragraphs from left and right column

tifa365 avatar Jan 24 '24 20:01 tifa365

Hi, I have also the same problem. Here is the link to the pdf files:

  1. https://www.datenschutz.rlp.de/fileadmin/lfdi/Dokumente/Orientierungshilfen/DSK_KPNr_2_Sanktionen.pdf
  2. https://www.datenschutz.rlp.de/fileadmin/lfdi/Dokumente/Orientierungshilfen/DSK_KPNr_3_Werbung.pdf

The other 18 pdf files listed here https://www.datenschutz.rlp.de/de/themenfelder-themen/datenschutz-grundverordnung/kurzpapiere-zur-auslegung-der-ds-gvo/ might also have the same issues. Thanks.

cahya-wirawan avatar Feb 11 '24 20:02 cahya-wirawan

Same problem on a 2-column pdf:

with 2 column text - it mixes paragraphs from left and right column

hannah-chy avatar Feb 16 '24 06:02 hannah-chy

any news about this issue?

cahya-wirawan avatar Mar 15 '24 22:03 cahya-wirawan

This should be fixed in the new version (coming in the next couple of weeks).

VikParuchuri avatar May 03 '24 05:05 VikParuchuri

Great 👍

On Fri, 3 May 2024, 07:07 Vik Paruchuri, @.***> wrote:

This should be fixed in the new version (coming in the next couple of weeks).

— Reply to this email directly, view it on GitHub https://github.com/VikParuchuri/marker/issues/50#issuecomment-2092213054, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB2KT6MRQRUL5TSI7BYRIC3ZAMLRXAVCNFSM6AAAAABBBXI36CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJSGIYTGMBVGQ . You are receiving this because you authored the thread.Message ID: @.***>

nekiee13 avatar May 03 '24 08:05 nekiee13

Try the dev branch if you're having issues - there is better ordering implemented there, but still have to test more before merging.

VikParuchuri avatar May 07 '24 18:05 VikParuchuri