PDFSegmenter
PDFSegmenter copied to clipboard
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
PDF Segmenter
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted in a CSV-style.
How-to
- Pass the path of the PDF file (as a string) which is wanted to be converted to
PDFSegmenter
. - Call the function
segment_document()
. - The function
get_labeled_graphs()
returns page-wise document graph representations as a list ofnetworkx
graphs. The labels indicate a clustering assignment. -
segments2json()
returns a JSON representation of the segmented document. -
segments2text()
returns a textual representation of the segmented document. This can be either annotated (lists, text and tables are supported) or not and controlled via the boolean parameterannotate
.
Example call:
segmenter = PDFSegmenter(pdf)
segmenter.segment_document()
result = segmenter.segments2json()
text = segmenter.segments2text()
graphs = get_labeled_graphs()
A file is the only parameter mandatory for the page segmentation.
A more detailed example usage is also given in Tester.py
.
Example
JSON
tbd
Annotated text
tbd
Graph
The following image shows a resulting document graph representation when using the GraphConverter
.
data:image/s3,"s3://crabby-images/42062/420628f2a83645d5e2e4a3a597bc2fc2402a2df6" alt=""
Settings
Clustering
tbd
Merging
tbd
Classification
tbd
Graph
General parameters:
-
file
: file name -
merge_boxes
: indicating if PDF text boxes should be graph nodes, based on visual rectangles present in documents. -
regress_parameters
: indicating if graph parameters are regressed or used as a priori optimized default ones.
Edge restrictions:
-
use_font
: differing font size -
use_width
: differing width -
use_rect
: nodes contained in differing visual structures -
use_horizontal_overlap
: indicating if horizontal edges should be built on overlap. If not, default deltas are used. -
use_vertical_overlap
: indicating if vertical edges should be built on overlap. If not, default deltas are used.
Edge thresholds:
-
page_ratio_x
: maximal relative horizontal distance of two nodes where an edge can be created -
page_ratio_y
: maximal relative vertical distance of two nodes where an edge can be created -
x_eps
: alignment epsilon for vertical edges in points ifuse_horizontal_overlap
is not enabled -
y_eps
: alignment epsilon for horizontal edges in points ifuse_vertical_overlap
is not enabled -
font_eps_h
: indicates how much font sizes of nodes are allowed to differ as a constraint for building horizontal edges whenuse_font
is enabled -
font_eps_v
: indicates how much font sizes of nodes are allowed to differ as a constraint for building vertical edges whenuse_font
is enabled -
width_pct_eps
: relative width difference of nodes as a condition for vertical edges ifuse_width
is enabled -
width_page_eps
: indicating at which maximal width of a node the width should act as an edge condition ifuse_width
is enabled
Project Structure
tbd
Output Format
JSON
tbd
Text
tbd
Graph
As a result, a list of networkx
graphs is returned.
Each graph encapsulates a structured representation of a single page.
Edges are attributed with the following features:
-
direction
: shows the direction of an edge.-
v
: Vertical edge -
h
: Horizontal edge -
l
: Rectangular loop. This represents a novel concept encapsulating structural characteristics of document segments by observing if two different paths end up in the same node.
-
-
length
: Scaled length of an edge -
lengthx_phys
: Horizontal edge length -
lengthy_phys
: Vertical edge length -
weight
: Scaled total length
All nodes contain the following content attributes:
-
id
: unique identifier of the PDF element -
page
: page number, starting with 0 -
text
: text of the PDF element -
x_0
: left x coordinate -
x_1
: right x coordinate -
y_0
: top y coordinate -
y_1
: bottom y coordinate -
pos_x
: center x coordinate -
pos_y
: center y coordinate -
abs_pos
: tuple containing a page independent representation of(pos_x,pos_y)
coordinates -
original_font
: font as extracted by pdfminer -
font_name
: name of the font extracted fromoriginal_font
-
code
: font code as provided by pdfminer -
bold
: factor 1 indicating that a text is bold and 0 otherwise -
italic
: factor 1 indicating that a text is italic and 0 otherwise -
font_size
: size of the text in points -
masked
: text with numeric content substituted as # -
frequency_hist
: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbols -
len_text
: number of characters -
n_tokens
: number of words -
tag
: tag for key-value pair extractions, indicating keys or values based on simple heuristics -
box
: box extracted by pdfminer Layout Analysis -
in_element_ids
: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element. -
in_element
: indicates based on in_element_ids whether an element is stored in a visual rectangle representation (stored as "rectangle") or not (stored as "none"). -
is_loop
: indicates whether or not a node is connected via a rectangular loop
Acknowledgements
- Example PDFs are obtained from the ICDAR Table Recognition Challenge 2013 https://roundtrippdf.com/en/data-extraction/pdf-table-recognition-dataset/.
Authors
- Michael Benedikt Aigner
- Florian Preis