SimpleIDML icon indicating copy to clipboard operation
SimpleIDML copied to clipboard

print(xxx.export_xml()) only print'<Root/>'

Open ZXTFINAL opened this issue 10 months ago • 6 comments

Image

Image

ZXTFINAL avatar Mar 13 '25 09:03 ZXTFINAL

Hello, You may provide with a bit of details if you want someone to look into it. And being somehow polite is a plus. Thanks

Starou avatar Mar 13 '25 11:03 Starou

I'm seeing the same problem, I think. pkg.export_xml() is just '<Root/>\n' and pkg.xml_structure_pretty() is b'<Root Self="di3"/>\n'.

My XML/BackingStory.xml looks like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<idPkg:BackingStory xmlns:idPkg="http://ns.adobe.com/AdobeInDesign/idml/1.0/packaging" DOMVersion="20.4">
        <XmlStory Self="u91" AppliedTOCStyle="n" UserText="true" IsEndnoteStory="false" TrackChanges="false" StoryTitle="$ID/" AppliedNamedGrid="n">
                <ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/$ID/NormalParagraphStyle">
                        <CharacterStyleRange AppliedCharacterStyle="CharacterStyle/$ID/[No character style]">
                                <Content></Content>
                                <XMLElement Self="di3" MarkupTag="XMLTag/Root" />
                                <Content></Content>
                        </CharacterStyleRange>
                </ParagraphStyleRange>
        </XmlStory>
</idPkg:BackingStory>

That IDML file was freshly exported with latest InDesign 2025.

I don't understand what the code in pkg.xml_structure(self) is supposed to do exactly, but "Discover the XML structure from the story files." suggests that it should create an xml structure with all the stories, ideally in reading order? Well, I have a lot of story XMLs, but they don't turn up here.

I'm also not sure what "Starting at BackingStory.xml where the root-element is expected (because unused)." is about, particularly the "because unused" part - what is unused and why would that make it the root element (that other stories could be discovered from)?

Of course, as usual for Adobe Products, the documentation for IDML is extremely hard to find and bad/outdated, but the IDML Specification from CS6 (the last one they released) says:

XML Folder

The XML folder contains XML elements and settings used in the In­Design document.

The XML elements referred to here are the XML elements that actually appear in the In­Design document (i.e., what you see in the Structure view in the InDesign user interface); not the contents of the XML files in the IDML archive. Though an IDML file is made up of XML, the In­Design document it describes does not necessarily contain XML elements.

Backing­Story.xml

The Backing­Story.xml file contains the unplaced XML content of the In­Design document (i.e., XML content that has not yet been associated with an element in the layout).

(emphasis mine) If this documentation was correct and still is correct, BackingStory.xml probably shouldn't be used, especially when trying to get the actually used parts of the document?

Or is xml_structure() etc only about "XML elements that actually appear in the In­Design document" and not about the XML files the IDML is made of? If so, is there a way to access simple_idml.components.Story objects (created from Stories/Story_u2121.xml) from IDMLPackage, ideally with their identifier used in TextFrames (e.g. "u2121" for <TextFrame Self="u211c" ParentStory="u2121" ...)?

Thanks in advance! :)

DanielGibson avatar Aug 13 '25 04:08 DanielGibson

Hi @DanielGibson : What is the Structure of your INDD file in InDesign please?

Image

Starou avatar Aug 15 '25 08:08 Starou

I currently don't have access to InDesign, but IIRC the structure was empty (or only contained Root). The INDD files I'm processing are "normal" InDesign documents, not using XMLElement (except for the minimal BackingStory.xml I posted) or XMLContent.

My usecase is basically converting InDesign files to Markdown (for an internal search engine; it turned out that the PDF files generated by InDesign screw up the text order when trying to extract the text from them). So it seemed like the most feasible workflow is converting the existing InDesign INDD files to IDML and them using python with SimpleIDML (and "manual" lxml) to generate Markdown from that.

Maybe pkg.export_xml() is not meant to be used for my usecase but only for XMLElement-based stuff? I don't know, it's pretty confusing that InDesign has this XML-based IDML format and also "XML" features that are (almost) completely independent from that...

DanielGibson avatar Sep 03 '25 00:09 DanielGibson

The IDML is a package of XML files representing your binary indd file in a non-binary manner. The XML Structure of the InDesign file is a logical representation of your document that enable you to do some automated actions like import content in a template knowing the structure (directly in InDesign or programmatically). But the Structure must be described/created in InDesign first, it cannot be automatically discovered. You will have a hard time getting the logical order of the blocks without a structure, only x, y position.

Maybe converting PDF to txt or md is still your best bet?

Starou avatar Sep 03 '25 07:09 Starou

Thanks for the clarification!

Converting IDML to Markdown actually works relatively well (much better than trying to parse PDF at least - and I tried lots of parsers, including super-slow ones using "AI"), I just have to do a bit more "manually" than I originally assumed, particularly opening Story XMLs (based on <TextFrame ... ParentStory="asdf" ...> where "asdf" is the Story ID), like

def get_story_xml(idml_pkg, story_id):
	story_name = f"Stories/Story_{story_id}.xml"
	with idml_pkg.open(filename) as f:
		 return etree.parse(f)

which is simple enough, but I assumed SimpleIDML would provide this functionality by itself, but it seems like all its story-handling assumes using XMLElement etc.

TBH I even hoped that SimpleIDML would somehow create one big XML-tree (or e-tree) that somehow inlines the XMLs as appropriate, e.g. by putting the stuff from story XMLs under the <TextFrame> nodes that reference them, and that export_xml() would be the function to do this. But now that I know a bit more about IDML I'm not so sure if that would make sense anyway..

And of course most work is not opening XML files but parsing them, trying to detect headings based on the paragraph style name and things like that... it's all still messy, but much better than parsing PDF - at least I now get the main article text in one piece and the right order now and tables are also looking pretty good. PDF parsers got confusing by articles using multiple columns, by image captions and more things, mixing all that text together in the wrong order..

DanielGibson avatar Sep 09 '25 00:09 DanielGibson