extracting data from headers
hello,
since you use antiword to convert .doc -> .docx you are loosing the headers of these files. I was unable to find a github page for antiword so i thought to write it here do you think that there is a way do extract these headers from .doc files?
thanks for this good package by the way!
Yes.. i'm trying to get the header for the .doc file which antiword is not able to give out. I tried the shell version of antiword against the .doc file of interest. digging into the documentation and finding a way for textract to do the same. Would post if something good turns up.
Textract uses doc2txt to parse docx files and antiword for doc files. There's no conversion from doc to docx files happening. Antiword can not extract the header and I'm not aware of any other tool that can. doc2txt should automatically include the header, as you can see here.
Are you aware of any python package or command line tool that can parse doc files including their header? If so, I'll happily review your PR that adds supports for it to textract.
A quick search let me to wv, abiword, libreoffice and catdoc. I only had succes extracting the header with catdoc, but all the others tools produced much better results for the body text.
If anyone knowns any better tools, feel free to open a PR.