pydocx
pydocx copied to clipboard
Incorrect parsing of list items - missing tabs/spaces.
Hi.
I have such a file:
after converting to html we get:
As you can see the subsections are not properly formatted. If you guide me where to look for this issue I can submit a pull request to solve this.
Thx a lot.
Before:
After:
It looks like there is one definite bug, along with some confusion about the styling.
"Gather Items for Re-pricing" should definitely be in the same list as "Prepare your markdown gun" and it's not obvious to me why it isn't.
The first step will be adding a fixtures testcase by adding both a .docx and .html file in the fixtures directory. That will let us define the input and then the expected output.
If anyone could help with that part, it would be appreciated. From there, someone will need to dive in to the OOXML in the .docx to figure out why we're parsing the .docx as separate lists instead of one list.
I dove in and took a look at the OOXML for this. I've added the fixtures as well.
It looks like what's happening is that it's being considered three different lists because the bulleted list is breaking up the numeric list.
Here is the simplified relevant document.xml
OOXML:
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>one</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>two</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>three</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>AAA</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>BBB</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>CCC</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="2"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>alpha</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>four</w:t></w:r></w:p>
<w:p/>
<w:p/>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr></w:pPr><w:r><w:t>xxx</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="1"/><w:numId w:val="6"/></w:numPr></w:pPr><w:r><w:t>yyy</w:t></w:r></w:p>
<w:p/>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr></w:pPr><w:r><w:t>www</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="1"/><w:numId w:val="7"/></w:numPr></w:pPr><w:r><w:t>zzz</w:t></w:r></w:p>
The full document.xml
OOXML is beautified in this gist: https://gist.github.com/jhubert/29f7899073b765e74297
Can you include numbering.xml
and styles.xml
as well?
Of course. Gist update: https://gist.github.com/jhubert/29f7899073b765e74297
Also, here is the docx file: nested_multitype_lists.docx
@kylegibson I'm about to work on this issue. Have you already started?
Hi Jeremy. None of us have started work on this issue. I expect it will be awhile before we have time to dedicate to fixing this. We'll be happy to review any PRs that you submit!
Awesome. Good to know. We'll get a PR together. :)
The fix that @botzill put together in #225 is now live in production. No issues so far. 💯