marker icon indicating copy to clipboard operation
marker copied to clipboard

Is there a way to restrict the areas of a page that are read?

Open tpanza opened this issue 1 year ago • 6 comments

I am having to process a large PDF document. It has some logos, boilerplate text, and other useless text in the top, bottom, left, and right margins of every page.

The model(s) seem to be struggling with recognizing these and turning them into Markdown text. It results in random gobbledygook at every PDF page boundary in the resulting Markdown file.

Might there be a way to pass in some settings so that these margin areas are ignored? I took at look in the settings.py (https://github.com/VikParuchuri/marker/blob/master/marker/settings.py) but didn't see anything about that.

I see on the README, "Removes headers/footers/other artifacts", but how do I control/tweak that?

tpanza avatar Jun 14 '24 18:06 tpanza

yep, would be great to have that. in my case, I'd like to EXPAND THE AREA such that certain headers/footers are actually included because now they are omitted while containing important headings (e.g. of tables)

luc42ei avatar Jun 21 '24 16:06 luc42ei

actually, one can expand the area by changing the BAD_SPAN_TYPES parameter in the settings.py file. it seems like removing all elements there would imply expanding the area to 100%

luc42ei avatar Jun 22 '24 13:06 luc42ei

@luc42ei In version 1, there is no BAD_SPAN_TYPES anymore. What are you using now?

svenha avatar Dec 02 '24 15:12 svenha

Coming here for the sam thought. I see "Removes headers/footers/other artifacts" on the readme, but can not find a way to control it.

Nevermetyou65 avatar Jan 01 '25 08:01 Nevermetyou65

@VikParuchuri How do I return headers and footers back into the text?

dibu28 avatar Jan 28 '25 22:01 dibu28

@VikParuchuri How do I return headers and footers back into the text?

same question here

gingemonster avatar Feb 25 '25 08:02 gingemonster