[TIKA-4315] Fix XPS whitespace not being emitted
Fixes TIKA-4315 by emitting ignorable whitespace.
This is not a perfect solution because it checks to see if there is space already included in the characters output. It might be better to emit the whitespace based on the distance between the text.
Any feedback is appreciated.
Y, I now remember this issue with XPS. This is the same challenge with PDFs. Interpolating whitespace has to be determined based on character width and the distance between text blocks, which means that you have to sort text chunks so that they're on the same "line", and you have to get reading order correct, etc.
Y, I now remember this issue with XPS. This is the same challenge with PDFs. Interpolating whitespace has to be determined based on character width and the distance between text blocks, which means that you have to sort text chunks so that they're on the same "line", and you have to get reading order correct, etc.
Thank you for the feedback, that makes sense. I'll take a look at the PDF implementation and see if I can implement this properly.
The PDF implementation can be found in PDFTextStripper.java in the PDFBox project.
I've had a go at implementing this, changes in c42466f, I can squash into one commit if that is preferred.
The implementation uses the indices string that is provided in XPS. It is a list of information for each glyph in a run. The useful information is the advance which based on https://learn.microsoft.com/en-us/windows/win32/api/xpsobjectmodel/ns-xpsobjectmodel-xps_glyph_index is measured in 1/100 em. Using this we can calculate the distance between runs and decide based on a threshold if a whitespace should be inserted. I have added some test xps files that I made to test this.
This implementation has some assumptions and limitations. Mainly that we do not get the glyph advance value for the last glyph in a run. I have used the average advance or 0.5 as a fallback in this case.
It also sorts the runs based on LTR unless all runs in a row are RTL. This maybe incorrect for cases where there is LTR and a multiple runs of RTL but I am not knowledgeable in this area.
Any feedback is appreciated :)
Wow. That's fantastic. Thank you!