PHPDocumentParser icon indicating copy to clipboard operation
PHPDocumentParser copied to clipboard

[Question] About end of file

Open JCarlosR opened this issue 7 years ago • 7 comments

Hi. Thank you for create PHPDocumentParser.

I am using it in my project and the content is extracted right.

But always there are some strange characters at the end (for .doc files):

endoffile

Do you have some idea to fix it?

JCarlosR avatar Sep 13 '17 15:09 JCarlosR

I will take a look as soon as. Could you send through the original doc please

LukeMadhanga avatar Sep 13 '17 15:09 LukeMadhanga

I only see those characters for .doc files.

I am attaching 2 files: Docs.zip

Thank you.

JCarlosR avatar Sep 13 '17 15:09 JCarlosR

@JCarlosR This is going to take a little bit longer to investigate than I anticipated! I'll try to get back to you tomorrow! Thanks for the kind words by the way!

LukeMadhanga avatar Sep 13 '17 21:09 LukeMadhanga

@JCarlosR btw, as these docs are public, are you sure that you still want them up here, or are you okay with that?

LukeMadhanga avatar Sep 13 '17 21:09 LukeMadhanga

Don't worry, I have no problem with that. And thank you for your time.

JCarlosR avatar Sep 13 '17 21:09 JCarlosR

@JCarlosR In your environment, do you have antiword installed? If you have the ability to install this, please do whilst I try and fix this the vanilla way

LukeMadhanga avatar Sep 14 '17 07:09 LukeMadhanga

@LukeMadhanga Yes. I have typed antiword in the cmd (I am using Windows) and it shows:

Name: antiword
Purpose: Display MS-Word files
Author: (C) 1998-2005 Adri van Os
Version: 0.37  (21 Oct 2005)
Status: GNU General Public License
Usage: antiword [switches] wordfile1 [wordfile2 ...]
Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]
        -f formatted text output
        -t text output (default)
        -a <paper size name> Adobe PDF output
        -p <paper size name> PostScript output
           paper size like: a4, letter or legal
        -x <dtd> XML output
           like: db (DocBook)
        -m <mapping> character mapping file
        -w <width> in characters of text output
        -i <level> image level (PostScript only)
        -L use landscape mode (PostScript only)
        -r Show removed text
        -s Show hidden (by Word) text

I have an Ubuntu vps and there, the extracted content also has the strange characters at the end. Thank you.

JCarlosR avatar Sep 14 '17 15:09 JCarlosR