pydocx
pydocx copied to clipboard
Space between item listing/tables
Currently if we have a list like:

is exported as:

So, there is no space between items.
This is applied to tables as well: input:

output:

As I check the code I see that this was deliberately done, via:
def export_paragraph(self, paragraph):
results = super(PyDocXHTMLExporter, self).export_paragraph(paragraph)
results = is_not_empty_and_not_only_whitespace(results)
if results is None:
return
Any reason why we do that?
Basically I think that we need to detect empty paragraph and convert them into <br/>
to have proper output.
If I recall correctly, it's because word documents can have these blank p's, but don't actually render to anything in a document. Empty p's in OOXML do not necessarily translate to a line break in HTML. If in doubt, 1) check the spec: how does it say empty p's should be handled? 2) construct a word document with some empty p's. Open the document in Word. What happens?
Yes, I did some tests and basically if we add an empty <w:p/>
it will be rendered as new line. Of course there can be different scenarios about this depending where <w:p/>
is located. To be honest I could not find proper information about empty p
, I just did tests with doc.
I did some work related to this here: https://github.com/botzill/pydocx/commit/34ee04591e324511880eed52f8fc0757e4360917.
To properly allow <w:p/>
to be rendered we need to reset html p
tag default margins and allow those empty p
to be processed. Empty paragraph is replaced with: <p> </p>
so that it will work in lists as well.
This way we don't actually need this method : https://github.com/CenterForOpenScience/pydocx/blob/9cd76eeb1f99cb3e580a8138a00295087f86eae0/pydocx/export/base.py#L255.
But not sure yet if this will cover all the cases. From tests I did seems be fine so far.
The info I found about p
: https://msdn.microsoft.com/en-us/library/gg278323.aspx
The most basic unit of block-level content within a WordprocessingML document, paragraphs are stored using the
<p>
element. A paragraph defines a distinct division of content that begins on a new line. A paragraph can contain three pieces of information: optional paragraph properties, inline content (typically runs), and a set of optional revision IDs used to compare the content of two documents.
Also here: https://msdn.microsoft.com/en-us/library/documentformat.openxml.wordprocessing.paragraph.aspx. But no info related to empty paragraphs.