python-docx2txt icon indicating copy to clipboard operation
python-docx2txt copied to clipboard

It does not convert numbered items

Open robo3945 opened this issue 7 years ago • 10 comments

Your code does not convert numbered items in lists.

An example:

  1. bla bla bla bla
  2. bla bla bla bla

Are converted in:

bla bla bla bla bla bla bla bla

robo3945 avatar May 07 '17 05:05 robo3945

Hi, I too faced the same issues mentioned by robo3945. Also I we should add the code to extract the endnotes.xml & footnotes.xml . Similar to the header and footer*.xml

Deep12d avatar May 09 '17 10:05 Deep12d

I will try to work on this feature but feel free to send a pull request too.

ankushshah89 avatar May 26 '17 15:05 ankushshah89

Yes this would be super, super useful if we could extract bullets/numbered lists specifically!

HubCatz202 avatar Oct 29 '17 18:10 HubCatz202

Hi..Somebody have any update on this issue. Still I am not able to extract bullets/numbered lists

biswajithgopinathan avatar Jun 28 '18 05:06 biswajithgopinathan

Hi...? Is anybody here...?

MostafaRabia avatar Dec 28 '20 08:12 MostafaRabia

I have a fork that converts bullets and numbers in docx to plain text.

https://github.com/ShayHill/docx2python

To get something close to python-docx2txt output:

pip install docx2python
text = docx2python('path/to/file.docx').text

ShayHill avatar Dec 28 '20 17:12 ShayHill

@ShayHill very good, thank you

MostafaRabia avatar Dec 28 '20 17:12 MostafaRabia

@ShayHill Is there is option to not to convert hyper links to a tag? I just need the text.

MostafaRabia avatar Dec 28 '20 17:12 MostafaRabia

There is not, though this may be accomplished post-conversion with re.sub:

re.sub(r'<a.*?>(.*?)<a/>', r"\1", text)

ShayHill avatar Dec 28 '20 19:12 ShayHill

very good, thanks but it's </a> not <a/> 😅😅

MostafaRabia avatar Dec 29 '20 07:12 MostafaRabia