tika
tika copied to clipboard
Mapping PDF document TextPosition to XHTML span attributes, Lucene queries
Mapping PDF document TextPosition to XHTML span attributes (e.g. coordinates to style). Rendering XHTML using JFX WebView (it looks like parsed PDF). I've changed Tika App GUI card layout to JTabbedPane, it's easier to use.
Screen with parsed sample invoice found here: https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
Lucene queries UI, auto-indexing (in RAM) tested documents:
Thank you for opening this.
Would you be able to break this into 2 separate pull requests: one for the PDFParser modfications, and one for the mods to tika-app's gui.
On the PDFParser mods, is there any way to make the syntax similar to what we get from Tesseract's hocr setting?
On an unrelated note, as I just discovered on tika-eval, the Lucene 6.x branch requires Java 8. We're trying to keep Tika at Java 7 for now, so please downgrade Lucene to 5.5.3.
Finally, can you also open issues on our jira for these two pull requests?
Thank you!
On the PDFParser mods, is there any way to make the syntax similar to what we get from Tesseract's hocr setting
@epugh would this be of use to you? Would you need/want same format as hocr?
This is what hocr looks like: sliced_invoice.pdf.hocr.txt
Sorry for the delay.
-
TikaGUI on Java 7. There is no problem with Lucene 5.5.3 and javafx.scene.web.WebView (@since JavaFX 2.0),... I think. I'm going to downgrade it
-
PDFParser. I looking for abstraction for all text documents parsers. I need to get as a result collection of pages (DOM, stream or events), containing collection of areas (lines, tables, ...), finally containing collection of text elements. Each text element should have text-value, coordinates (x, y, top, left), size (height, width) and font properties. With this abstraction I could e.g. implement preview each parsed text document or implement text elements mapper (to business values). Do you have abstraction like this? Is it a good plan to create it?
- TikaGUI on Java 7 - done. https://github.com/apache/tika/pull/167