gofpdf Memory Usage

I have a large pdf to generate, but it seems that the library buffer all the thing in memory。 So can it output the buffer at demand to disk? and how to do it? thanks.

May 11 '17 01:05 CowSmiles

Interesting question, @klniu. I will look into this further, but I think the buffer's WriteTo() method may do what you want. There will be a need to maintain the number of bytes written so that PDF offsets can be calculated. (Currently, the Len() method is used directly.) Also, the appropriate times or places to call WriteTo() will be important to make prevent the buffer from growing too large.

May 11 '17 01:05 jung-kurt

thanks. Good suggestion. I will fork the project, make public of the fpdf buffer, and try what you said.

May 11 '17 02:05 CowSmiles

Thanks, @klniu, for taking this on.

Rather than making the buffer public, I would make a new API called RawFlush(w io.Writer) with a comment that indicates that it is not required for normal PDF construction. (See the comment for RawWriteBuf()). That way, you can maintain a new internal integer field that maintains the number of bytes already written, something like BytesWritten. This will get added to f.buffer.Len() to calculate offsets where needed.

Let me know if you hit any snags.

May 11 '17 10:05 jung-kurt

I recently started using this library, and am also interested in such a feature, for a different reason: My application is generating very large (500MB) PDFs on-the-fly as downloads. Instead of having the user open a link and wait for 5 minutes, and have to deal with http timeouts, it would be much nicer to be able to write to the http.ResponseWriter after adding a page, like I am able to do for on-the-fly large ZIP generation. I might try and take this on in a fork if I have time

Apr 17 '19 04:04 chocolatkey

Interesting idea, @chocolatkey. Implementing the http.ResponseWriter interface (Header, Write, and WriteHeader) should not be a problem. My concern is whether all the really long parts in PDF generation take place before the actual emission of data. Methods like SetPage() allow the code to jump from page to page; this wouldn't be allowed if the page data has already been written. I suggest instrumenting your PDF generation timeline with microsecond logging to see where the time gets used. If 4.5 minutes is used generating content and organizing dictionaries and the last 0.5 minute is actual data streaming, then implementing the http.ResponseWriter interface may not achieve your goals. Keep us posted with what you find!

Apr 17 '19 11:04 jung-kurt

@jung-kurt Oh it's not the PDF generation or your library that's at fault, it's just how my code happens to work. The PDFs contain high-resolution images. These images are downsized on the fly from very high-resolution sources, and this downsizing is what takes a while, and is done while adding it to the PDF. Code extract (where w is w http.ResponseWriter):

pdf := gofpdf.NewCustom(&gofpdf.InitType{
	OrientationStr: "P",
	UnitStr:        "pt",
	Size: gofpdf.SizeType{
		Ht: float64(destHeight),
		Wd: float64(destWidth),
	},
})
comment := fmt.Sprintf("%s %s - vips %s", APP_NAME, APP_VERSION, vips.VipsVersion)
pdf.SetCreator(comment, false)
pdf.SetTitle(meta.Title, true)
if len(meta.Publisher) > 0 {
	pdf.SetAuthor(meta.Publisher[0].Name, true)
}

// Start writing the PDF - At this point no more error messages will get through
w.Header().Set("Content-Type", PDF_MIME)
w.Header().Set("Content-Disposition", fmt.Sprintf("attachment; filename=%s_%d.pdf", meta.Title, destHeight))

for j, page := range epubFiles {
	pdf.AddPage()
	if j == 0 {
		pdf.Bookmark("Cover", 0, -1)
	} else {
		for _, navPoint := range nav {
			if page.Name == navPoint.Link {
				pdf.Bookmark(navPoint.Title, 0, -1)
			}
		}
	}
	buf := new(bytes.Buffer)
	err := resizeImage(buf, filepath.Join(basepath, page.Name), destHeight, label)
	if err != nil {
		log.Printf("failed converting file %q for PDF: %v", page.Name, err)
		return
	}
	opts := gofpdf.ImageOptions{
		ImageType: "jpg",
		ReadDpi:   true,
	}
	pdf.RegisterImageOptionsReader(page.Name, opts, buf)
	pdf.ImageOptions(page.Name, 0, 0, float64(destWidth), float64(destHeight), false, opts, 0, "")
}

pdf.Output(w)

Apr 17 '19 19:04 chocolatkey

It would be interesting to know if @klniu made progress with the streaming buffer idea.

After reading Streaming a PDF From the Web to a Mobile or Desktop App I think that PDF linearization is needed to support your goal. I'm not sure how much work would be needed to rework gofpdf to generate linearized PDFs. Also, I'm not sure if there would be undesirable consequences -- I think there are some post-production tools in use that don't understand the linearized PDF format. I think certain methods like SetPage() and AliasNbPages() would need to be disabled.

Apr 17 '19 20:04 jung-kurt

I am generating a pdf file with hundreds of pages. It eats up all my memory and freeze my os. It would be nice if we could stream out the "written" part or write the "written" part to file and keep adding the later "written" part to the file.

Apr 25 '19 17:04 gypsyfeng

@gypsyfeng

It would be nice if we could stream out the "written" part or write the "written" part to file and keep adding the later "written" part to the file.

I agree, this would be a very welcome enhancement. One approach would be to use gofpdf to generate a bunch of single page PDFs, and then use a PDF utility like pdftk or qpdf to merge them together. It may be, however, that doing this would simply defer the memory issues to the merging utility.

Apr 29 '19 19:04 jung-kurt

@chocolatkey

These images are downsized on the fly from very high-resolution sources, and this downsizing is what takes a while

It may be that the downsizing algorithm used by the image package in the Go standard library, or the way gofdpf is calling into it, is not as optimal as other utilities. Maybe an independent application could downsize a number of images concurrently and gofpdf could read the output of that program.

Apr 29 '19 19:04 jung-kurt

gofpdf gofpdf copied to clipboard

Memory Usage

gofpdf
gofpdf copied to clipboard