pypdf
pypdf copied to clipboard
Support for Optional Content Groups
PyPDF2 does not currently have any support for Optional Content Groups (OCGs). When merging multiple documents into a single document the layers are effectively flattened and functionality is lost.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf 4.10.2 Making Graphical Content Optional
Thanks - We definitely need support for layered PDFs to display correctly (and possibly support for adding/removing layers?
It looks like the OCG settings are stored in the 'OCProperties' dictionary in 'Root' - see dump below.
The problem here is that it uses IndirectObjects, which don't necessarily have the same ID in the input PDF to the output PDF when the page is appended. How do we get the ID of the corresponding new object in the output PDF?
{'/OCProperties': {'/D': {'/ListMode': '/VisiblePages',
'/Locked': [IndirectObject(8, 0),
IndirectObject(9, 0)],
'/OFF': [IndirectObject(11, 0),
IndirectObject(12, 0)],
'/Order': [IndirectObject(1, 0),
[IndirectObject(2, 0),
IndirectObject(3, 0),
IndirectObject(4, 0)],
[u'PDF Drawing Layer',
IndirectObject(5, 0),
IndirectObject(6, 0),
IndirectObject(7, 0),
IndirectObject(8, 0),
IndirectObject(9, 0)],
IndirectObject(10, 0),
[IndirectObject(11, 0),
IndirectObject(12, 0),
IndirectObject(13, 0)]],
'/RBGroups': [[IndirectObject(11, 0),
IndirectObject(12, 0),
IndirectObject(13, 0)]]},
'/OCGs': [IndirectObject(7, 0),
IndirectObject(3, 0),
IndirectObject(9, 0),
IndirectObject(11, 0),
IndirectObject(1, 0),
IndirectObject(8, 0),
IndirectObject(6, 0),
IndirectObject(4, 0),
IndirectObject(12, 0),
IndirectObject(2, 0),
IndirectObject(10, 0),
IndirectObject(13, 0),
IndirectObject(5, 0)]},
'/OpenAction': {'/D': [IndirectObject(25, 0), '/Fit'], '/S': '/GoTo'},
'/PageLayout': '/SinglePage',
'/PageMode': '/UseOC',
'/Pages': IndirectObject(24, 0),
'/Type': '/Catalog',
'/ViewerPreferences': {'/NonFullScreenPageMode': '/UseNone'}}
Screenshot below of the PDF in Acrobat Reader (on Linux) that was used for the dump above.

How did you get the dumped structure?
@emmama1234 I've got the dumped structure like this:
from PyPDF2 import PdfFileReader
reader = PdfFileReader(file('test.pdf','rb'))
reader.trailer['/Root']['/OCProperties']
@snorfalorpagus did you find any way to add/remove OCG layers to multiple pages Pdf with PyPDF2?
I didn't get any further than viewing the data as posted above. :(
@snorfalorpagus Thanks! i'll take a look at it to see if i can find something
As this feature request didn't receive an update for a long time, I'm closing it.
I'm linking it in https://github.com/py-pdf/PyPDF2/discussions/1181 so that we don't forget about it. Please feel free to add more information (PDFs that use it; other projects that implement it; explanations how it would improve PyPDF2)
@snorfalorpagus @emmama1234 Hi, I know it's been a while but did either of you ever get further with this? I'm working on the same thing and used reader.trailer['/Root']['/OCProperties'] and reader._get_object to get all the direct object references. Then re-mapped them to the writer using writer._add_object and writer._root_object['/OCProperties']. The pdf still has trouble opening but I feel like I'm close. Do any of you have any suggestions? I can share the python code too if that helps.
@MartinThoma would love for this to be implemented in pypdf.
You are of course always invited to provide a corresponding PR to add such support.
@stefan6419846 Hi, I saw it was linked in https://github.com/py-pdf/pypdf/discussions/1181. My code does not work yet and still stuck.. The output pdf still does not have layers information and is blank.
Here's what I am doing so far:
import pypdf
from pypdf.generic import ArrayObject, DictionaryObject, NameObject
def get_ocgs_direct(reader):
ocgs_props = DictionaryObject({})
if "/OCProperties" in reader.trailer["/Root"]:
ocgs_props = reader.root_object["/OCProperties"]
if (len(ocgs_props) > 0):
# get direct objects for ocgs
for i, indirect in enumerate(ocgs_props["/OCGs"]):
pdfobject = reader.get_object(indirect)
ocgs_props[NameObject("/OCGs")][i] = pdfobject
# repeat for order
for i, indirect in enumerate(ocgs_props["/D"]["/Order"]):
if isinstance(indirect, pypdf.generic._data_structures.ArrayObject):
# nested lists to resolve
arr = ArrayObject([reader.get_object(indirect[0])])
arr.append(ArrayObject([reader.get_object(indirect[1][0])]))
ocgs_props[NameObject("/D")][NameObject("/Order")][i] = ArrayObject(arr)
else:
pdfobject = reader.get_object(indirect)
ocgs_props[NameObject("/D")][NameObject("/Order")][i] = pdfobject
return ocgs_props
def set_ocgs_direct(writer, ocgs_direct):
# re-reference ocjs to writer pdf using add_object
for i, direct in enumerate(ocgs_direct["/OCGs"]):
indirectobject = writer._add_object(DictionaryObject(direct)) # find out name object type
ocgs_direct[NameObject("/OCGs")][i] = indirectobject
# should update [/d][/order] already.
for i, direct in enumerate(ocgs_direct["/D"]["/Order"]):
if isinstance(direct, pypdf.generic._data_structures.ArrayObject):
#nested lists to resolve
direct[0] = writer._add_object(DictionaryObject(direct[0]))
direct[1][0] = writer._add_object(DictionaryObject(direct[1][0]))
else:
indirect = writer._add_object(DictionaryObject(direct))
ocgs_direct[NameObject("/D")][NameObject("/Order")][i] = indirect
if "/OCProperties" in writer.root_object.keys():
writer.root_object[NameObject("/OCProperties")].update(ocgs_direct)
else:
writer._root_object[NameObject("/OCProperties")] = DictionaryObject(ocgs_direct)
@mmalik1234 I am going to re-open this issue for now as there seems to be further interest, although this will probably only be supported if this is a contributed as a PR and is easy enough to maintain. I cannot help much with this, as I have no use case which would use OCGs.
Having a quick look at your code, you probably should not have to use the full class path for ArrayObject in the isinstance (although only related to style). Additionally, the last condition looks wrong with the backslash as key.
@stefan6419846 Thanks. This comment from 1181 shows my use case. https://github.com/py-pdf/pypdf/discussions/1181#discussioncomment-3408544
My usecase is, that I set pagenumbers or other overlay information on each page. I want to put such overlays on a named layer for easy identification and en-/disable viewing and printing hereof.
Specifically I want to be able to identify the layer to be able to remove the overlay again, e.g. inserting new pagenumbers while removing the old beforehand.
Thus, I need to be able to add the layer and ensure I can draw overlays on them.
Other way to identify and remove an overlay might also work - but layers / OCGs are the current idea.
I would like this to happen! I added an issue in pdfly before figuring out it needs support here first. https://github.com/py-pdf/pdfly/issues/190
My use case has to do with the PDF exports from AutoCAD which put each CAD layer (windows, doors, walls) on separate PDF layers / OCGs, and I would like the ability to work with them programmatically: toggle on/off, remove, add, copy to a new file, etc.
pymupdf has the ability to toggle visibility, but I'm not sure about the rest:
import pymupdfd
doc = pymupdf.open("file.pdf")
ogcs = doc.get_ocgs()
lyr_xref = ogcs[0]
doc.set_layer(-1, on=[lyr_xref], basestate="OFF") # turn off all but the first layer
In pypdf I can retrieve these objects pretty easily, how can I actually do things with them?
import pypdf
doc = pypdf.PdfReader("file.pdf")
ocgs = doc.root_object.get_object()["/OCProperties"]["/OCGs"]
xrefs = [ocg.idnum for ocg in ocgs]
layers = [doc.get_object(xref)['/Name'] for xref in xrefs]
objs = [doc.get_object(xref) for xref in xrefs]
In pypdf, you mostly need to know the official PDF specification and understand it to work with this, although it always is hard to help without an actual example file.
Section 8.11 of the PDF 2.0 specification defines the necessary aspects. The visibility policy for an OCG membership dictionary (OCMD) is defined by its visibility policy key /P in the easiest case, being set to either /AllOn, /AnyOn, /AnyOff or /AllOff. This can be overridden by the /VE key which defines a more complex visibility expression.
At the moment, every action should be possible in theory if you use the low-level interface where necessary. The main issue might be that we do not have a high-level interface at the moment.
@stefan6419846 Thanks! I'll look into the API and see what I can find.
Here's a sample file: building001-0_floor1.pdf
You can find other CAD examples here and export to PDF with TrueVew (gratis) to see the layers situation.
Just getting my bearings in the codebase and the reference spec.... Could you perhaps point me to a basic example of what you mean by the "low-level interface?"
In constants.py I see the implementations of various spec tables, am I correct that Section 8.11 would need to be added here? For example I'm not seeing an implementation of Table 97 for the OGC visibility policy, just OC_PROPERTIES in CatalogDictionary.
For example, given that you have a OCMD (obj['/Type'] == '/OCMD'), you could access the policy with obj['/P'] and update it with obj[NameObject('/P')] = NameObject('/AllOn').
am I correct that Section 8.11 would need to be added here?
Yes, constants/names would be added there, but it is not strictly necessary for the basic usage. If we add proper support for OCGs to pypdf, we would have to define a suitable interface first anyway.
You can find other CAD examples here and export to PDF with TrueVew (gratis) to see the layers situation.
I do not have a suitable Windows machine available, but as I am most likely not going to implement the new functionality myself anyway due to different reasons, it might help interested community members to look into it and evaluate implementations.