marker
marker copied to clipboard
Segmenting Markdown-converted PDFs into pages
Hi @VikParuchuri,
Thank you very much for creating this invaluable package which I have found extremely useful in several projects already. I just wanted to ask if an option could be added to indicate where pages start and end in the outputted Markdown? Even having the ability to add a custom delimiter such as <page>
would help.
For anyone else interested in preserving page boundaries, I managed to add a page delimiter by:
- Replacing the
merge_lines()
function inmarkdown.py
with the following:def merge_lines(blocks, page_blocks: List[Page]): text_blocks = [] prev_type = None prev_line = None block_text = "" block_type = "" common_line_heights = [p.get_line_height_stats() for p in page_blocks] for page_i, page in enumerate(blocks): for block in page: block_type = block.most_common_block_type() if block_type != prev_type and prev_type: text_blocks.append( FullyMergedBlock( text=block_surround(block_text, prev_type), block_type=prev_type ) ) block_text = "" prev_type = block_type # Join lines in the block together properly for i, line in enumerate(block.lines): line_height = line.bbox[3] - line.bbox[1] prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0 prev_line_x = prev_line.bbox[0] if prev_line else 0 prev_line = line is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x if block_text: block_text = line_separator(block_text, line.text, block_type, is_continuation) else: block_text = line.text # This is where the magic happens! if page_i != len(blocks) - 1: block_text += '' # This is where the magic ends! # Append the final block text_blocks.append( FullyMergedBlock( text=block_surround(block_text, prev_type), block_type=block_type ) ) return text_blocks
- Replacing
lowercase_letters = "a-zà-öø-ÿа-яşćăâđêôơưþðæøå"
in theline_seperator()
function ofmarkdown.py
withlowercase_letters = "a-zà-öø-ÿа-яşćăâđêôơưþðæøå"
. This ensures that delimiters do not cause newlines to be inserted in the middle of lines.
This uses 
(Unicode's object replacement character) instead of <page>
as it is a single character and can therefore be added directly to the lowercase_letters
regex character set instead of having to rework regex patterns. You may replace it with any other character of your choosing.
This is a bit of a hacky solution so I'd still like to see page segmentation implemented officially in marker
.
YES, You need edit schema.py
and edit markdown.py `def merge_lines(blocks, page_blocks: List[Page]): text_blocks = [] prev_type = None prev_line = None block_text = "" block_type = "" block_pnum = 0 common_line_heights = [p.get_line_height_stats() for p in page_blocks] for page in blocks: for block in page: block_pnum = block.pnum block_type = block.most_common_block_type() if block_type != prev_type and prev_type: text_blocks.append( FullyMergedBlock( text=block_surround(block_text, prev_type), block_type=prev_type, pnum=block_pnum ) ) block_text = "" prev_type = block_type # Join lines in the block together properly for i, line in enumerate(block.lines): line_height = line.bbox[3] - line.bbox[1] prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0 prev_line_x = prev_line.bbox[0] if prev_line else 0 prev_line = line is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x if block_text: block_text = line_separator(block_text, line.text, block_type, is_continuation) else: block_text = line.text
# Append the final block
text_blocks.append(
FullyMergedBlock(
text=block_surround(block_text, prev_type),
block_type=block_type,
pnum=block_pnum
)
)
return text_blocks`
@nunamia How about making a merge of this solution?
However, I´m observing issues with the page numbers. I have a document vom EU Parliament where every page has content but the page numbers are too often and jump
@Terranic Try out my solution, I haven't found that issue with it.
Thanks for the script @umarbutler . This is on my list of features to include, as a few people have asked for it
Here's a script to monkeypatch Marker with @umarbutler 's solution:
import ast
import inspect
import marker.postprocessors.markdown
class MarkdownTransformer(ast.NodeTransformer):
def __init__(self):
self.current_function = None
def visit_FunctionDef(self, node):
# Store the current function name
self.current_function = node.name
# Visit all the child nodes within the function
self.generic_visit(node)
# Reset current function name to None after leaving the function
self.current_function = None
return node
def visit_Assign(self, node):
if self.current_function == 'line_separator':
if isinstance(node.targets[0], ast.Name) and node.targets[0].id == 'lowercase_letters':
if isinstance(node.value, ast.Constant) and isinstance(node.value.value, str):
original_value = node.value.value # might want node.value.s
new_value = original_value + '|'
node.value = ast.Constant(value=new_value)
return node
def visit_For(self, node):
if self.current_function == 'merge_lines':
# Check if the loop iterates over a variable named 'page'
if isinstance(node.target, ast.Name) and node.target.id == 'page':
# Change the loop to use enumerate
node.iter = ast.Call(
func=ast.Name(id='enumerate', ctx=ast.Load()),
args=[node.iter],
keywords=[]
)
node.target = ast.Tuple(elts=[
ast.Name(id='page_i', ctx=ast.Store()),
ast.Name(id='page', ctx=ast.Store())
], ctx=ast.Store())
# Create the additional check and append operation
page_check = ast.parse("""
if page_i != len(blocks) - 1:
block_text += ''
""").body[0]
node.body.append(page_check)
return node
# Get the source code and make the AST
markdown_source = inspect.getsource(marker.postprocessors.markdown)
markdown_ast = ast.parse(markdown_source)
# Create the AST transformer instance
markdown_transformer = MarkdownTransformer()
# Perform the transformation (explores the tree and applies defined transformation functions, returning the new tree)
markdown_ast = markdown_transformer.visit(markdown_ast)
# Fix missing locations in the modified AST
ast.fix_missing_locations(markdown_ast)
# Replace the functions in the actual module - e.g. internal module calls to
# marker.postprocessors.markdown.line_separator will use the updated version.
exec(compile(markdown_ast, filename='<ast>', mode='exec'), marker.postprocessors.markdown.__dict__)
Less debugging for others,the method of using @umarbutler requires changing the two files marker/schema/merged.py
and marker/postprocessors/markdown.py
note:tested on marker-pdf==0.2.5
merged.py
from collections import Counter
from typing import List, Optional
from pydantic import BaseModel
from marker.schema.bbox import BboxElement
class MergedLine(BboxElement):
text: str
fonts: List[str]
def most_common_font(self):
counter = Counter(self.fonts)
return counter.most_common(1)[0][0]
class MergedBlock(BboxElement):
lines: List[MergedLine]
pnum: int
block_type: Optional[str]
class FullyMergedBlock(BaseModel):
text: str
block_type: str
pnum: int
markdown.py,replace merge_lines function.
def merge_lines(blocks: List[List[MergedBlock]]):
text_blocks = []
prev_type = None
prev_line = None
block_text = ""
block_type = ""
block_pnum = 0
# common_line_heights = [p.get_line_height_stats() for p in page_blocks]
for page_i, page in enumerate(blocks):
for block in page:
block_pnum = block.pnum
block_type = block.block_type
if block_type != prev_type and prev_type:
text_blocks.append(
FullyMergedBlock(
text=block_surround(block_text, prev_type),
block_type=prev_type,
pnum=block_pnum
)
)
block_text = ""
prev_type = block_type
# Join lines in the block together properly
for i, line in enumerate(block.lines):
line_height = line.bbox[3] - line.bbox[1]
prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
prev_line_x = prev_line.bbox[0] if prev_line else 0
prev_line = line
is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x
if block_text:
block_text = line_separator(block_text, line.text, block_type, is_continuation)
else:
block_text = line.text
# This is where the magic happens!
if page_i != len(blocks) - 1:
block_text += ''
# This is where the magic ends!
# Append the final block
text_blocks.append(
FullyMergedBlock(
text=block_surround(block_text, prev_type),
block_type=block_type,
pnum=block_pnum
)
)
return text_blocks