Stirling-PDF icon indicating copy to clipboard operation
Stirling-PDF copied to clipboard

[Bug]: `Extract Pages` produce output pdf of same size

Open KAGEYAM4 opened this issue 1 year ago • 1 comments

The Problem

I have pdf of 34MB that contains 256 pages, i only extracted single page out of it and it produced pdf of almost same size. I tested it on 2 pdfs, one pdf output was same size the other was half the size ( 17 MB of 31 MB ).

Version of Stirling-PDF

0.26.1

Last Working Version of Stirling-PDF

No response

Page Where the Problem Occurred

http://localhost:8080/extract-page

Docker Configuration

version: '3.3'
services:
  stirling-pdf:
    image: frooodle/s-pdf:latest-ultra-lite
    ports:
      - '8080:8080'
    volumes:
      - ./trainingData:/usr/share/tessdata #Required for extra OCR languages
#      - ./extraConfigs:/configs
#      - ./customFiles:/customFiles/
#      - ./logs:/logs/
    environment:
      - DOCKER_ENABLE_SECURITY=false
      - INSTALL_BOOK_AND_ADVANCED_HTML_OPS=false
      - LANGS=en_GB

Relevant Log Output

No response

Additional Information

To circumvent this i used http://localhost:8080/split-pdfs

Browsers Affected

No response

No Duplicate of the Issue

  • [X] I have verified that there are no existing issues raised related to my problem.

KAGEYAM4 avatar Jun 17 '24 07:06 KAGEYAM4

So doing more research on this, this seems quit common due to the fonts and metadata etc i would be curious to see how other tools perform to see if we need to do some changes on our side

Frooodle avatar Aug 03 '24 15:08 Frooodle

And yet the 'Split' function produces multiple files of expected size (size sum roughly equals to the size of the original file).

Could you maybe reuse its implementation for 'Extract Pages'? (And 'Remove Pages')

d-patyk avatar Sep 20 '25 12:09 d-patyk

Hi,

So I tested with this sample: image-doc.pdf

When I extract the first the online debugger gives following information:

Image

When you split:

Image

TLDR:

Extract Pages:

  • Attempts to remove pages
  • Tries to clean up unused resources (fonts, images, metadata) but
    • Some PDFs use a global /Resources dictionary, one shared by all pages. Which leads to the fact that everything in that /Resources dict stays in the Document even if they don't appear on the chosen page.

Split PDF:

  • Creates brand new, clean PDF documents from scratch (you can clearly see this from the CreationDate metadata, and creator metadata)
  • Only includes the resources actually needed for the pages being split
  • No attempt to modify the original document (which is very hard to get right)

My hands are full now but this is really good first issue, it should be rather trivial

balazs-szucs avatar Sep 23 '25 17:09 balazs-szucs

Hi! I will start working on this, can you assign it to me ? @Frooodle @balazs-szucs

OUNZAR-Aymane avatar Sep 28 '25 15:09 OUNZAR-Aymane

@OUNZAR-Aymane

I think you can definitely take this. I am just a contributor so I can't actually assign you this, but judging by the fact there are no complaining people fighting you for this, I think you are fine :)

balazs-szucs avatar Oct 01 '25 15:10 balazs-szucs