tabulapdf extract_text without trimming

trafficstars

I am not sure this is an issue per se but I think it would be very useful to preserve the spacing of the text without trimming. For example, if something appeared on screen as

" [whitespace....................]hello world gfdaggfdagfda [whitespace....................]"

right now i believe Tabulizer would yield

" hello world gfdaggfdagfda"

Another example would be

" hello world [whitespace....................] gfdaggfdagfda [whitespace....................]"

tabulizer might yield

" hello world gfdaggfdagfda "

Perhaps there is a way to do this now, but I missed it. Even trying something like extract_tables(guess=FALSE,columns...) won't do the trick because of the aforementioned trimming issue. The only thing I can think of doing is literally creating coordinate by coordinate columns. Like,

extract_tables(file=f,guess=FALSE,pages=1,columns=list(seq(1,900,by=1)))

Perhaps that is the recommended move? But it seems less than ideal as it is incredibly computationally expensive for what its doing

Nov 05 '16 19:11 alanpaulkwan

I think this is an inherent limitation of PDF format. As I understand it, white space is not represented as actual "space" characters but rather as horizontal offsets for the represented text. So, the underlying tabula library has no way of knowing how much space there is because there's nothing there except the horizontal start position of the text. I could be wrong as I'm not a PDF expert, but my fear is your workaround might be the only way to achieve this.

Nov 12 '16 11:11 leeper

That makes sense, although what I'm suggesting t would just be about representing the offsets with whitespace. It sounds like what you're saying is as far as you know, Tabula doesn't give options to do this. Since your goal is to create an R binding I suppose it's a feature request to be send over to the tabula guys?

RPoppler / pdftools seems to get along the lines of what I want, but there are some problems there too. Some of the text in adjacent lines gets mashed.

On Sat, Nov 12, 2016 at 6:12 AM, Thomas J. Leeper [email protected] wrote:

I think this is an inherent limitation of PDF format. As I understand it, white space is not represented as actual "space" characters but rather as horizontal offsets for the represented text. So, the underlying tabula library has no way of knowing how much space there is because there's nothing there except the horizontal start position of the text. I could be wrong as I'm not a PDF expert, but my fear is your workaround might be the only way to achieve this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ropenscilabs/tabulizer/issues/25#issuecomment-260116005, or mute the thread https://github.com/notifications/unsubscribe-auth/AH745FI8l1zfyk-FO2N_9lQEp1SXGn8Jks5q9Z84gaJpZM4KqW1b .

Nov 12 '16 15:11 alanpaulkwan

Oh actually extract_text() isn't a tabula feature. It just uses pdfbox. If it looks like it possible directly with PDFbox, I can try to implement it but I don't think it is possible.

Nov 12 '16 17:11 leeper

I can't figure out how to do it here, but I have a piece of Java code... can I send it to you?

Nov 12 '16 17:11 alanpaulkwan

Thanks. I will take a look as soon as I can.

Nov 12 '16 17:11 leeper

Awesome, thanks. Hoping it helps improve the package!

On Sat, Nov 12, 2016 at 12:21 PM, Thomas J. Leeper <[email protected]

wrote:

Thanks. I will take a look as soon as I can.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ropenscilabs/tabulizer/issues/25#issuecomment-260135174, or mute the thread https://github.com/notifications/unsubscribe-auth/AH745A2nECuO9LxBw_6DFhbCPUWNBK5Qks5q9fWNgaJpZM4KqW1b .

Nov 12 '16 22:11 alanpaulkwan

tabulapdf tabulapdf copied to clipboard

extract_text without trimming

tabulapdf
tabulapdf copied to clipboard