pydocx
pydocx copied to clipboard
Space between item listing/tables
Currently if we have a list like:
is exported as:
So, there is no space between items.
This is applied to tables as well: input:
output:
As I check the code I see that this was deliberately done, via:
def export_paragraph(self, paragraph):
results = super(PyDocXHTMLExporter, self).export_paragraph(paragraph)
results = is_not_empty_and_not_only_whitespace(results)
if results is None:
return
Any reason why we do that?
Basically I think that we need to detect empty paragraph and convert them into <br/> to have proper output.
If I recall correctly, it's because word documents can have these blank p's, but don't actually render to anything in a document. Empty p's in OOXML do not necessarily translate to a line break in HTML. If in doubt, 1) check the spec: how does it say empty p's should be handled? 2) construct a word document with some empty p's. Open the document in Word. What happens?
Yes, I did some tests and basically if we add an empty <w:p/> it will be rendered as new line. Of course there can be different scenarios about this depending where <w:p/> is located. To be honest I could not find proper information about empty p, I just did tests with doc.
I did some work related to this here: https://github.com/botzill/pydocx/commit/34ee04591e324511880eed52f8fc0757e4360917.
To properly allow <w:p/> to be rendered we need to reset html p tag default margins and allow those empty p to be processed. Empty paragraph is replaced with: <p> </p> so that it will work in lists as well.
This way we don't actually need this method : https://github.com/CenterForOpenScience/pydocx/blob/9cd76eeb1f99cb3e580a8138a00295087f86eae0/pydocx/export/base.py#L255.
But not sure yet if this will cover all the cases. From tests I did seems be fine so far.
The info I found about p: https://msdn.microsoft.com/en-us/library/gg278323.aspx
The most basic unit of block-level content within a WordprocessingML document, paragraphs are stored using the
<p>element. A paragraph defines a distinct division of content that begins on a new line. A paragraph can contain three pieces of information: optional paragraph properties, inline content (typically runs), and a set of optional revision IDs used to compare the content of two documents.
Also here: https://msdn.microsoft.com/en-us/library/documentformat.openxml.wordprocessing.paragraph.aspx. But no info related to empty paragraphs.