uwazi icon indicating copy to clipboard operation
uwazi copied to clipboard

[Paragraph extraction] backend

Open gabriel-piles opened this issue 6 months ago • 0 comments

The paragraph extraction feature should operate on top of the segmentation process. Segmentation is activated on a per-instance basis, processing all PDFs within that instance, including newly uploaded ones. Segmentation is controlled by the following feature toggle:

features.segmentation: {url: "http://10.0.11.196:5051/async_extraction"}

For backend paragraph extraction, the system should retrieve segmentation data from the segmentation MongoDB collection and filter segment types to exclude unwanted content such as page headers, text on pictures, or formulas.

The segmentation paragraphs in Mongo looks like this:

left: number; top: number; width: number; height: number; page_number: number; text: string; type: string; // not in Uwazi yet

The different types of segmentation the service returns (aka document layout analysis async) are as follows:

"Caption" "Footnote" "Formula" "List item" "Page footer" "Page header" "Picture" "Section header" "Table" "Text" "Title"

The desired paragraph could be this short list:

"List item" "Section header" "Text" "Title"

Find more information about the segmentation in the following repositories:

https://github.com/huridocs/pdf-document-layout-analysis-async https://github.com/huridocs/pdf-document-layout-analysis

gabriel-piles avatar Aug 16 '24 12:08 gabriel-piles