uwazi
uwazi copied to clipboard
[Paragraph extraction] backend
The paragraph extraction feature should operate on top of the segmentation process. Segmentation is activated on a per-instance basis, processing all PDFs within that instance, including newly uploaded ones. Segmentation is controlled by the following feature toggle:
features.segmentation: {url: "http://10.0.11.196:5051/async_extraction"}
For backend paragraph extraction, the system should retrieve segmentation data from the segmentation MongoDB collection and filter segment types to exclude unwanted content such as page headers, text on pictures, or formulas.
The segmentation paragraphs in Mongo looks like this:
left: number; top: number; width: number; height: number; page_number: number; text: string; type: string; // not in Uwazi yet
The different types of segmentation the service returns (aka document layout analysis async) are as follows:
"Caption" "Footnote" "Formula" "List item" "Page footer" "Page header" "Picture" "Section header" "Table" "Text" "Title"
The desired paragraph could be this short list:
"List item" "Section header" "Text" "Title"
Find more information about the segmentation in the following repositories:
https://github.com/huridocs/pdf-document-layout-analysis-async https://github.com/huridocs/pdf-document-layout-analysis