tika icon indicating copy to clipboard operation
tika copied to clipboard

Mapping PDF document TextPosition to XHTML span attributes, Lucene queries

Open kinjelom opened this issue 8 years ago • 4 comments

Mapping PDF document TextPosition to XHTML span attributes (e.g. coordinates to style). Rendering XHTML using JFX WebView (it looks like parsed PDF). I've changed Tika App GUI card layout to JTabbedPane, it's easier to use.

Screen with parsed sample invoice found here: https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

image

Lucene queries UI, auto-indexing (in RAM) tested documents:

image

kinjelom avatar Feb 19 '17 14:02 kinjelom

Thank you for opening this.

Would you be able to break this into 2 separate pull requests: one for the PDFParser modfications, and one for the mods to tika-app's gui.

On the PDFParser mods, is there any way to make the syntax similar to what we get from Tesseract's hocr setting?

On an unrelated note, as I just discovered on tika-eval, the Lucene 6.x branch requires Java 8. We're trying to keep Tika at Java 7 for now, so please downgrade Lucene to 5.5.3.

Finally, can you also open issues on our jira for these two pull requests?

Thank you!

tballison avatar Feb 21 '17 13:02 tballison

On the PDFParser mods, is there any way to make the syntax similar to what we get from Tesseract's hocr setting

@epugh would this be of use to you? Would you need/want same format as hocr?

This is what hocr looks like: sliced_invoice.pdf.hocr.txt

tballison avatar Feb 21 '17 18:02 tballison

Sorry for the delay.

  1. TikaGUI on Java 7. There is no problem with Lucene 5.5.3 and javafx.scene.web.WebView (@since JavaFX 2.0),... I think. I'm going to downgrade it

  2. PDFParser. I looking for abstraction for all text documents parsers. I need to get as a result collection of pages (DOM, stream or events), containing collection of areas (lines, tables, ...), finally containing collection of text elements. Each text element should have text-value, coordinates (x, y, top, left), size (height, width) and font properties. With this abstraction I could e.g. implement preview each parsed text document or implement text elements mapper (to business values). Do you have abstraction like this? Is it a good plan to create it?

kinjelom avatar Apr 12 '17 14:04 kinjelom

  1. TikaGUI on Java 7 - done. https://github.com/apache/tika/pull/167

kinjelom avatar Apr 12 '17 15:04 kinjelom