textract
textract copied to clipboard
PPT Support?
Pre-2007 powerpoint
I've looked at this in the past and haven't found a great way to get it done. One option would be to use another tool to convert it to, for instance, a .docx or something along those lines. If you know of any tools or libraries that can extract from PPT let me know.
Looked into this again. Didn't find any suitable options.
If anyone has ideas, speak up. =)
There seems to be a catppt mentioned on the page of catdoc.. http://www.wagner.pp.ru/%7Evitus/software/catdoc/
Ya, I've seen that, but I've been having a hell of a time trying to get catdoc installed locally since I installed Mavericks awhile back. Needed to take a break from pounding my head against a wall.
@dbashford @subutux @shuttz this can be done in pure JS. http://msdn.microsoft.com/en-us/library/office/cc313106.aspx is the specification. Most of the code can be pulled from the XLS parser https://github.com/SheetJS/js-xls .
I am somewhat confused, however, by the ordering of the text in the PPTX extractor. I tested with some PPTX files generated in PowerPoint 2011 and 2013, and some slides seem to appear before others.
https://www.dropbox.com/s/tt7o0r841sgqlwt/layout_types_2011.pptx is an example:
$ textract layout_types_2011.pptx | grep ^Slide
Slide 11: Vertical Title and text
Slide 10: Vertical Text
Slide 1 Title
Slide 1 Subtitle
Slide 2: Title and Content
Slide 3: Section header
Slide 4: Two-Content
Slide 5: Comparison
Slide 8: Content w/Caption
Slide 9: picture with caption
I made the slides in order, so it's not obvious to me why they are not sorted properly.
starts with the 1s, string sort eh? That almost holds up except within the 1s its sorting in reverse. You'd expect in that case it to be
Slide 1 Subtitle Slide 1 Title Slide 10: Vertical Text Slide 11: Vertical Title and text
I'll take the one you stuck out and play with it.
Made a new issue #27, and addressed the problem. New release out.
I'll look into the links you had for the ppt stuff.
Hrm, think that looks like a ton of work.
@subutux catppt doesn't seem to work.
https://www.dropbox.com/s/gyo76tzjxqm0ft0/layout_types_2011.ppt?dl=0 is the converted PPT file, and catdoc shows:
$ catppt layout_types_2011.ppt
Office Theme
@subutux @shuttz @dbashford I took a swing at it. The node module ppt
https://github.com/SheetJS/js-ppt installs a binary ppt
whose sole purpose is to dump the text.
It is reasonably consistent. I think there is an issue in the XML:
$ diff <(ppt layout_types_2011.ppt) <(textract layout_types_2011.pptx)
53c53,54
< … or maybe rotating the projector? That seems so barbaric
---
> … or maybe rotating the projector? That seems so barbaric
>
The PPT form has two spaces between the ?
and the T
, but textract only preserves one.
Good stuff!
The nuking of two spaces to one was a conscious decision on my part. It doesn't effect the text.
When you say there is an issue in the XML, what do you mean?
The nuking of two spaces to one was a conscious decision on my part. It doesn't effect the text.
You answered the question. I initially thought it might have been an issue with space preservation in the XML
In the end what I'm doing is an across the board reduction of more than 1 space down to 1 space. This is moreso to avoid things like...
end of sentence. And start of a new one
...which tend to pop up with oddly formatted documents of all types.
I'll look into adding this soon.
@SheetJSDev the ppt module is a great start!
Windows port of ppt/doc/xls to text http://blog.brush.co.nz/2009/09/catdoc-windows/