tika icon indicating copy to clipboard operation
tika copied to clipboard

[TIKA-4315] Fix XPS whitespace not being emitted

Open ruwi-next opened this issue 1 year ago • 1 comments

Fixes TIKA-4315 by emitting ignorable whitespace.

This is not a perfect solution because it checks to see if there is space already included in the characters output. It might be better to emit the whitespace based on the distance between the text.

Any feedback is appreciated.

ruwi-next avatar Oct 01 '24 14:10 ruwi-next

Y, I now remember this issue with XPS. This is the same challenge with PDFs. Interpolating whitespace has to be determined based on character width and the distance between text blocks, which means that you have to sort text chunks so that they're on the same "line", and you have to get reading order correct, etc.

tballison avatar Oct 16 '24 17:10 tballison

Y, I now remember this issue with XPS. This is the same challenge with PDFs. Interpolating whitespace has to be determined based on character width and the distance between text blocks, which means that you have to sort text chunks so that they're on the same "line", and you have to get reading order correct, etc.

Thank you for the feedback, that makes sense. I'll take a look at the PDF implementation and see if I can implement this properly.

ruwi-next avatar Oct 21 '24 09:10 ruwi-next

The PDF implementation can be found in PDFTextStripper.java in the PDFBox project.

THausherr avatar Oct 21 '24 10:10 THausherr

I've had a go at implementing this, changes in c42466f, I can squash into one commit if that is preferred.

The implementation uses the indices string that is provided in XPS. It is a list of information for each glyph in a run. The useful information is the advance which based on https://learn.microsoft.com/en-us/windows/win32/api/xpsobjectmodel/ns-xpsobjectmodel-xps_glyph_index is measured in 1/100 em. Using this we can calculate the distance between runs and decide based on a threshold if a whitespace should be inserted. I have added some test xps files that I made to test this.

This implementation has some assumptions and limitations. Mainly that we do not get the glyph advance value for the last glyph in a run. I have used the average advance or 0.5 as a fallback in this case.

It also sorts the runs based on LTR unless all runs in a row are RTL. This maybe incorrect for cases where there is LTR and a multiple runs of RTL but I am not knowledgeable in this area.

Any feedback is appreciated :)

ruwi-next avatar Oct 24 '24 09:10 ruwi-next

Wow. That's fantastic. Thank you!

tballison avatar Oct 24 '24 11:10 tballison