[Bug]: `Extract Pages` produce output pdf of same size
The Problem
I have pdf of 34MB that contains 256 pages, i only extracted single page out of it and it produced pdf of almost same size. I tested it on 2 pdfs, one pdf output was same size the other was half the size ( 17 MB of 31 MB ).
Version of Stirling-PDF
0.26.1
Last Working Version of Stirling-PDF
No response
Page Where the Problem Occurred
http://localhost:8080/extract-page
Docker Configuration
version: '3.3'
services:
stirling-pdf:
image: frooodle/s-pdf:latest-ultra-lite
ports:
- '8080:8080'
volumes:
- ./trainingData:/usr/share/tessdata #Required for extra OCR languages
# - ./extraConfigs:/configs
# - ./customFiles:/customFiles/
# - ./logs:/logs/
environment:
- DOCKER_ENABLE_SECURITY=false
- INSTALL_BOOK_AND_ADVANCED_HTML_OPS=false
- LANGS=en_GB
Relevant Log Output
No response
Additional Information
To circumvent this i used http://localhost:8080/split-pdfs
Browsers Affected
No response
No Duplicate of the Issue
- [X] I have verified that there are no existing issues raised related to my problem.
So doing more research on this, this seems quit common due to the fonts and metadata etc i would be curious to see how other tools perform to see if we need to do some changes on our side
And yet the 'Split' function produces multiple files of expected size (size sum roughly equals to the size of the original file).
Could you maybe reuse its implementation for 'Extract Pages'? (And 'Remove Pages')
Hi,
So I tested with this sample: image-doc.pdf
When I extract the first the online debugger gives following information:
When you split:
TLDR:
Extract Pages:
- Attempts to remove pages
- Tries to clean up unused resources (fonts, images, metadata) but
- Some PDFs use a global /Resources dictionary, one shared by all pages. Which leads to the fact that everything in that /Resources dict stays in the Document even if they don't appear on the chosen page.
Split PDF:
- Creates brand new, clean PDF documents from scratch (you can clearly see this from the CreationDate metadata, and creator metadata)
- Only includes the resources actually needed for the pages being split
- No attempt to modify the original document (which is very hard to get right)
My hands are full now but this is really good first issue, it should be rather trivial
Hi! I will start working on this, can you assign it to me ? @Frooodle @balazs-szucs
@OUNZAR-Aymane
I think you can definitely take this. I am just a contributor so I can't actually assign you this, but judging by the fact there are no complaining people fighting you for this, I think you are fine :)