pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

getText() returns text without any spaces when using a pdf from google docs

Open veepdotai opened this issue 1 year ago • 19 comments

  • PHP Version: PHP 8.3.0 (cli) (built: Nov 24 2023 13:48:03) (NTS) Copyright (c) The PHP Group Zend Engine v4.3.0, Copyright (c) Zend Technologies with Zend OPcache v8.3.0, Copyright (c), by Zend Technologies

  • PDFParser Version: "smalot/pdfparser": "^2.8"

Description:

If I create a document in google docs and download it in pdf format from google, I just get some text witout any spaces when parsing it with pdfparser.

PDF input

test-vgwg.pdf

Expected output & actual output

input in google docs : Very good work guys Thanks for everything.

download as pdf

I just do the code below

I get the following output:

Verygood workguys Thanksforeverything.

Code

use Smalot\PdfParser\Parser; $file = "test-vgwg.pdf"; $parser = new Parser(); $pdf = $parser->parseFile($file); $output = $pdf->getText(); var_dump($output);

veepdotai avatar Feb 09 '24 18:02 veepdotai

This seems to have something to do with how I've used the negative of the current position factor in PDFObject.php.

$factorX = -$current_font_size * $current_position_tm['a'] - $current_font_size * $current_position_tm['i'];

When I change this to positive:

$factorX = $current_font_size * $current_position_tm['a'] + $current_font_size * $current_position_tm['i'];

Then the OP's sample document prints with the proper spaces. However, I think changing this line also breaks a lot of the unit tests. Somehow Google Docs is playing around with negative values that I haven't accounted for here. I'll have to look into it more.

GreyWyvern avatar Feb 13 '24 16:02 GreyWyvern

i've been having same issue. it looks like v2.9 does not address this bug. any update?

LaRaye avatar Apr 02 '24 22:04 LaRaye

Can someone fix it please?

lopatin96 avatar Apr 09 '24 02:04 lopatin96

I'm looking at this and with the initial changes that read the OP's doc correctly, I can whittle it down to 3 unit test failures. I'm studying the failures to see if they are actually valid, or if more tweaking is needed.

GreyWyvern avatar Apr 09 '24 18:04 GreyWyvern

@GreyWyvern thank you very much for all your progress invested in the development of this project! I really appreciate it! Please tell me, are there any changes regarding this “bug”?

lopatin96 avatar May 08 '24 19:05 lopatin96

I'm still working on this. The fix involves using the matrix from cm commands as well as the Td and TD commands. Right now PdfParser only uses them from the Td and TD commands. However, while just inserting it gets me 98% of the way to a fix, There are two or three unit test PDFs where if I "fix" it for one, the other two break, and vice versa. 😩

Hopefully soon!

GreyWyvern avatar May 14 '24 18:05 GreyWyvern

Got it. Will keep an eye out. Really appreciate all your efforts!!

LaRaye avatar May 14 '24 19:05 LaRaye

@GreyWyvern Got it. Thanks for the info and good luck solving this problem.

lopatin96 avatar May 16 '24 01:05 lopatin96

@GreyWyvern is there any progress in the meantime? Due to this problem, the PDFParser is currently useless for me! I have the feeling, there are not only spaces where are missing, but also \t. Any workaround before a new version? Thanks a lot.

hgalt avatar Jun 18 '24 09:06 hgalt

Can you try this fork, @hgalt ? https://github.com/GreyWyvern/pdfparser/tree/google-docs

Does it solve your problem? I've boiled down a HUGE amount of changes to the small edits in the fork above. I like the fix (if it solves your issue!) because it's simple, but it has the consequence of adding unnecessary tabs in several other files. Extra tabs is definitely a smaller issue than complete lack of spaces (extra whitespace can easily be stripped by the user), so it might be good to send this as-is as a PR.

GreyWyvern avatar Jun 18 '24 18:06 GreyWyvern

@GreyWyvern For me, additional tabs are not a problem (I can then remove them using my code), it is much worse if there are no spaces between words. Please send it as a PR. And thanks for the work done!

lopatin96 avatar Jun 18 '24 18:06 lopatin96

@GreyWyvern Same here! Difficult to parse text with no whitespace. Extra tabs can be removed. Appreciate your hard work!

LaRaye avatar Jun 18 '24 19:06 LaRaye

@GreyWyvern I do something wrong, because I can not load the branch via composer. I added under require "greywyvern/pdfparser": "google-docs" and get the error Could not parse version constraint

hgalt avatar Jun 19 '24 09:06 hgalt

@GreyWyvern got it working without composer. This fork works for me! Great job, thaks for your effort.

hgalt avatar Jun 20 '24 13:06 hgalt

Hi, any news about this?

I'm having the same problem with the following pdf, made using WPS Office:

small_pdf.pdf

I tried the google-docs branch and the output still comes without spaces ;/

gilney-canaltelecom avatar Sep 16 '24 20:09 gilney-canaltelecom