python-docx2txt
python-docx2txt copied to clipboard
It does not convert numbered items
Your code does not convert numbered items in lists.
An example:
- bla bla bla bla
- bla bla bla bla
Are converted in:
bla bla bla bla bla bla bla bla
Hi, I too faced the same issues mentioned by robo3945. Also I we should add the code to extract the endnotes.xml & footnotes.xml . Similar to the header and footer*.xml
I will try to work on this feature but feel free to send a pull request too.
Yes this would be super, super useful if we could extract bullets/numbered lists specifically!
Hi..Somebody have any update on this issue. Still I am not able to extract bullets/numbered lists
Hi...? Is anybody here...?
I have a fork that converts bullets and numbers in docx to plain text.
https://github.com/ShayHill/docx2python
To get something close to python-docx2txt output:
pip install docx2python
text = docx2python('path/to/file.docx').text
@ShayHill very good, thank you
@ShayHill Is there is option to not to convert hyper links to a tag? I just need the text.
There is not, though this may be accomplished post-conversion with re.sub:
re.sub(r'<a.*?>(.*?)<a/>', r"\1", text)
very good, thanks
but it's </a>
not <a/>
😅😅