pydocx
pydocx copied to clipboard
Export content only
It would be extremely helpful to me if it were possible to export only the content of
with no
or
tags. I intend to pass the document on to further processing and will provide those parts myself.
Hello @tritium21,
You could use the pandoc tool to achieve what you are looking for. Once installed, you can convert a document to plain text with the following command in the terminal or command prompt:
pandoc test.docx -f docx -t plain -s -o test.txt
Hope the above helps you.
It would not be difficult to create a custom parser that strips out all the tags. It's something we've wanted to include anyway, so if you end up using that approach, PRs are welcome.
I have done something similar:
from pydocx.export.base import PyDocXExporter
class RawExporter(PyDocXExporter):
def apply_newlines(self, nodes):
if nodes:
return '\n'.join(node for node in nodes)
return ''
def export_paragraph(self, paragraph):
nodes = super(RawExporter, self).export_paragraph(paragraph)
return self.apply_newlines(nodes)
def export_break(self, br):
nodes = super(RawExporter, self).export_break(br)
return self.apply_newlines(nodes)
with open('test.docx') as fp:
output = ''.join(result for result in RawExporter(fp).export())
print(output)
@tritium21