pydocx icon indicating copy to clipboard operation
pydocx copied to clipboard

Incorrect parsing of list items - missing tabs/spaces.

Open botzill opened this issue 9 years ago • 9 comments

Hi.

I have such a file:

subsections_format.docx

after converting to html we get:

2ea64158-ae83-11e5-8ed4-d56bff32bef1

As you can see the subsections are not properly formatted. If you guide me where to look for this issue I can submit a pull request to solve this.

Thx a lot.

botzill avatar Feb 01 '16 11:02 botzill

Before:

image

After:

image

jhubert avatar Mar 13 '16 07:03 jhubert

It looks like there is one definite bug, along with some confusion about the styling.

"Gather Items for Re-pricing" should definitely be in the same list as "Prepare your markdown gun" and it's not obvious to me why it isn't.

The first step will be adding a fixtures testcase by adding both a .docx and .html file in the fixtures directory. That will let us define the input and then the expected output.

If anyone could help with that part, it would be appreciated. From there, someone will need to dive in to the OOXML in the .docx to figure out why we're parsing the .docx as separate lists instead of one list.

winhamwr avatar Mar 14 '16 16:03 winhamwr

I dove in and took a look at the OOXML for this. I've added the fixtures as well.

It looks like what's happening is that it's being considered three different lists because the bulleted list is breaking up the numeric list.

image

Here is the simplified relevant document.xml OOXML:

<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>one</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>two</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>three</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>AAA</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>BBB</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="5"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="709"/></w:tabs></w:pPr><w:r><w:t>CCC</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="2"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>alpha</w:t></w:r></w:p>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr></w:pPr><w:r><w:t>four</w:t></w:r></w:p>
<w:p/>
<w:p/>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr></w:pPr><w:r><w:t>xxx</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="1"/><w:numId w:val="6"/></w:numPr></w:pPr><w:r><w:t>yyy</w:t></w:r></w:p>
<w:p/>
<w:p><w:pPr><w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr></w:pPr><w:r><w:t>www</w:t></w:r></w:p>
<w:p w:rsidP="007F6A48"><w:pPr><w:numPr><w:ilvl w:val="1"/><w:numId w:val="7"/></w:numPr></w:pPr><w:r><w:t>zzz</w:t></w:r></w:p>

The full document.xml OOXML is beautified in this gist: https://gist.github.com/jhubert/29f7899073b765e74297

jhubert avatar Mar 15 '16 21:03 jhubert

Can you include numbering.xml and styles.xml as well?

kylegibson avatar Mar 15 '16 21:03 kylegibson

Of course. Gist update: https://gist.github.com/jhubert/29f7899073b765e74297

Also, here is the docx file: nested_multitype_lists.docx

jhubert avatar Mar 15 '16 21:03 jhubert

@kylegibson I'm about to work on this issue. Have you already started?

jhubert avatar Apr 12 '16 15:04 jhubert

Hi Jeremy. None of us have started work on this issue. I expect it will be awhile before we have time to dedicate to fixing this. We'll be happy to review any PRs that you submit!

kylegibson avatar Apr 12 '16 15:04 kylegibson

Awesome. Good to know. We'll get a PR together. :)

jhubert avatar Apr 12 '16 16:04 jhubert

The fix that @botzill put together in #225 is now live in production. No issues so far. 💯

jhubert avatar Feb 09 '17 06:02 jhubert