Cannot extract image from pptx, docx

Open amikai opened this issue 5 months ago • 4 comments

go-fitz version: v1.24.15 file:

Code

package main

import (
	"fmt"
	"image/jpeg"
	"os"

	"github.com/gen2brain/go-fitz"
)

func main() {
	doc, err := fitz.New("example.docx")
	if err != nil {
		panic(err)
	}
	defer doc.Close()

	// Extract pages as images
	img, err := doc.Image(0)
	if err != nil {
		panic(err)
	}

	f, err := os.Create("example.jpg")
	if err != nil {
		panic(err)
	}

	err = jpeg.Encode(f, img, &jpeg.Options{jpeg.DefaultQuality})
	if err != nil {
		panic(err)
	}

	f.Close()
}

Stdout output:

warning: dropping unclosed output

The output image will be a blank jpg: example.jpg

Jul 01 '25 00:07 amikai

I misunderstood; I thought go-fitz could extract images from a document. Close it.

Jul 01 '25 00:07 amikai

@amikai Yes, that is precisely what is possible; it's stated in the README. For broken files, you probably need a later version (I recall a changelog mentioning this).

Jul 01 '25 03:07 gen2brain

@gen2brain Do you mean I can extract images from .docx, .xlsx, and .pptx files using go-fitz? In my case, I created the file using Microsoft Word on a Mac. I just want to understand why these files are broken.

Jul 01 '25 07:07 amikai

Yes, you do not extract embedded images if that is what you think. It will render the documents, and you can export page by page as an image. I have no idea why or what is broken; this is a wrapper for MuPDF, and I don't know the internals. It may have already been solved, and you can try using an external library (with a newer version). Anyway, the fix will not happen in this repo, I update bundled libraries from time to time.

Jul 01 '25 08:07 gen2brain