textract icon indicating copy to clipboard operation
textract copied to clipboard

extracting data from headers

Open aster94 opened this issue 7 years ago • 3 comments

hello,

since you use antiword to convert .doc -> .docx you are loosing the headers of these files. I was unable to find a github page for antiword so i thought to write it here do you think that there is a way do extract these headers from .doc files?

thanks for this good package by the way!

aster94 avatar Mar 01 '18 16:03 aster94

Yes.. i'm trying to get the header for the .doc file which antiword is not able to give out. I tried the shell version of antiword against the .doc file of interest. digging into the documentation and finding a way for textract to do the same. Would post if something good turns up.

deepakjoseph08 avatar Sep 05 '19 11:09 deepakjoseph08

Textract uses doc2txt to parse docx files and antiword for doc files. There's no conversion from doc to docx files happening. Antiword can not extract the header and I'm not aware of any other tool that can. doc2txt should automatically include the header, as you can see here.

Are you aware of any python package or command line tool that can parse doc files including their header? If so, I'll happily review your PR that adds supports for it to textract.

jpweytjens avatar Sep 05 '19 14:09 jpweytjens

A quick search let me to wv, abiword, libreoffice and catdoc. I only had succes extracting the header with catdoc, but all the others tools produced much better results for the body text.

If anyone knowns any better tools, feel free to open a PR.

jpweytjens avatar Sep 05 '19 14:09 jpweytjens