pydocx icon indicating copy to clipboard operation
pydocx copied to clipboard

Export content only

Open tritium21 opened this issue 7 years ago • 3 comments

It would be extremely helpful to me if it were possible to export only the content of

with no

or

tags. I intend to pass the document on to further processing and will provide those parts myself.

tritium21 avatar May 27 '17 05:05 tritium21

Hello @tritium21,

You could use the pandoc tool to achieve what you are looking for. Once installed, you can convert a document to plain text with the following command in the terminal or command prompt: pandoc test.docx -f docx -t plain -s -o test.txt

Hope the above helps you.

bitscompagnie avatar Nov 02 '17 08:11 bitscompagnie

It would not be difficult to create a custom parser that strips out all the tags. It's something we've wanted to include anyway, so if you end up using that approach, PRs are welcome.

jlward avatar Nov 02 '17 14:11 jlward

I have done something similar:

from pydocx.export.base import PyDocXExporter


class RawExporter(PyDocXExporter):

    def apply_newlines(self, nodes):
        if nodes:
            return '\n'.join(node for node in nodes)
        return ''

    def export_paragraph(self, paragraph):
        nodes = super(RawExporter, self).export_paragraph(paragraph)
        return self.apply_newlines(nodes)

    def export_break(self, br):
        nodes = super(RawExporter, self).export_break(br)
        return self.apply_newlines(nodes)


with open('test.docx') as fp:
    output = ''.join(result for result in RawExporter(fp).export())
    print(output)

@tritium21

IuryAlves avatar May 24 '19 10:05 IuryAlves