textract icon indicating copy to clipboard operation
textract copied to clipboard

PPT Support?

Open shuttz opened this issue 9 years ago • 15 comments

Pre-2007 powerpoint

shuttz avatar Aug 15 '14 03:08 shuttz

I've looked at this in the past and haven't found a great way to get it done. One option would be to use another tool to convert it to, for instance, a .docx or something along those lines. If you know of any tools or libraries that can extract from PPT let me know.

dbashford avatar Aug 15 '14 23:08 dbashford

Looked into this again. Didn't find any suitable options.

If anyone has ideas, speak up. =)

dbashford avatar Aug 21 '14 03:08 dbashford

There seems to be a catppt mentioned on the page of catdoc.. http://www.wagner.pp.ru/%7Evitus/software/catdoc/

subutux avatar Aug 23 '14 10:08 subutux

Ya, I've seen that, but I've been having a hell of a time trying to get catdoc installed locally since I installed Mavericks awhile back. Needed to take a break from pounding my head against a wall.

dbashford avatar Aug 23 '14 12:08 dbashford

@dbashford @subutux @shuttz this can be done in pure JS. http://msdn.microsoft.com/en-us/library/office/cc313106.aspx is the specification. Most of the code can be pulled from the XLS parser https://github.com/SheetJS/js-xls .

I am somewhat confused, however, by the ordering of the text in the PPTX extractor. I tested with some PPTX files generated in PowerPoint 2011 and 2013, and some slides seem to appear before others.

https://www.dropbox.com/s/tt7o0r841sgqlwt/layout_types_2011.pptx is an example:

$ textract layout_types_2011.pptx | grep ^Slide
Slide 11: Vertical Title and text
Slide 10: Vertical Text
Slide 1 Title
Slide 1 Subtitle
Slide 2: Title and Content
Slide 3: Section header
Slide 4: Two-Content
Slide 5: Comparison
Slide 8: Content w/Caption
Slide 9: picture with caption

I made the slides in order, so it's not obvious to me why they are not sorted properly.

SheetJSDev avatar Aug 23 '14 16:08 SheetJSDev

starts with the 1s, string sort eh? That almost holds up except within the 1s its sorting in reverse. You'd expect in that case it to be

Slide 1 Subtitle Slide 1 Title Slide 10: Vertical Text Slide 11: Vertical Title and text

I'll take the one you stuck out and play with it.

dbashford avatar Aug 23 '14 17:08 dbashford

Made a new issue #27, and addressed the problem. New release out.

I'll look into the links you had for the ppt stuff.

dbashford avatar Aug 23 '14 18:08 dbashford

Hrm, think that looks like a ton of work.

dbashford avatar Aug 23 '14 18:08 dbashford

@subutux catppt doesn't seem to work.

https://www.dropbox.com/s/gyo76tzjxqm0ft0/layout_types_2011.ppt?dl=0 is the converted PPT file, and catdoc shows:

$ catppt layout_types_2011.ppt
Office Theme

SheetJSDev avatar Aug 25 '14 18:08 SheetJSDev

@subutux @shuttz @dbashford I took a swing at it. The node module ppt https://github.com/SheetJS/js-ppt installs a binary ppt whose sole purpose is to dump the text.

It is reasonably consistent. I think there is an issue in the XML:

$ diff <(ppt layout_types_2011.ppt) <(textract layout_types_2011.pptx)
53c53,54
< … or maybe rotating the projector?  That seems so barbaric
---
> … or maybe rotating the projector? That seems so barbaric
>

The PPT form has two spaces between the ? and the T, but textract only preserves one.

SheetJSDev avatar Aug 28 '14 00:08 SheetJSDev

Good stuff!

The nuking of two spaces to one was a conscious decision on my part. It doesn't effect the text.

When you say there is an issue in the XML, what do you mean?

dbashford avatar Aug 28 '14 00:08 dbashford

The nuking of two spaces to one was a conscious decision on my part. It doesn't effect the text.

You answered the question. I initially thought it might have been an issue with space preservation in the XML

SheetJSDev avatar Aug 28 '14 00:08 SheetJSDev

In the end what I'm doing is an across the board reduction of more than 1 space down to 1 space. This is moreso to avoid things like...

end of sentence.                  And start of a new one

...which tend to pop up with oddly formatted documents of all types.

I'll look into adding this soon.

dbashford avatar Aug 28 '14 00:08 dbashford

@SheetJSDev the ppt module is a great start!

shuttz avatar Aug 28 '14 21:08 shuttz

Windows port of ppt/doc/xls to text http://blog.brush.co.nz/2009/09/catdoc-windows/

panamantis avatar Jun 13 '15 17:06 panamantis