python-docx2txt
python-docx2txt copied to clipboard
Save list numeration
How to save list numeration?
After small research i learned how to gather numbers of lists. I simply done this in docx2txt.py:
def xml2text(xml):
...
for n, child in enumerate(root.iter()):
if child.tag == qn('w:ilvl'):
text += '\nGSHNCHIK' + child.attrib[qn('w:val')] + ' ' ### GSHNCHIK is just unique part to further split
elif child.tag == qn('w:numId'):
t_val = child.attrib[qn('w:val')] + '.'
textA, textB = text.split('GSHNCHIK')
text = textA + t_val + textB
elif child.tag == qn('w:t'):
t_text = child.text
text += t_text if t_text is not None else ''
elif child.tag == qn('w:tab'):
text += '\t'
elif child.tag in (qn('w:br'), qn('w:cr')):
text += '\n'
elif child.tag == qn('w:p'):
text += '\n\n'
return text
After this i am able to obtain "numIds" and "ilvls" attribute values (where numIds stands for list item and ilvl for level of list) But what if my list has custom numeration in docx document? Like following toy example: 1)First item 3)Second item 123) Third item
I couldn't find any dependences to this in styles.xml or whatever. is it really possible?
Rewrote xml2text to this. Now lists are shown well
def xml2text(xml):
text = u''
root = ET.fromstring(xml)
count_numId = set()
compare = False
for n, child in enumerate(root.iter()):
if child.tag == qn('w:ilvl'):
text += '\nKIHCNHSGANILUHSOG' + child.attrib[qn('w:val')] + ' ' ### KIHCNHSGANILUHSOG is just a random stuff, unique part to further splits
elif child.tag == qn('w:numId'):
t_val = child.attrib[qn('w:val')] + '.'
textA, textB = text.split('KIHCNHSG')
text = textA + t_val + textB
try:
compare = max(list(count_numId)) > int(child.attrib[qn('w:val')])
except:
pass
if compare:
pass
else:
count_numId.update({int(child.attrib[qn('w:val')])})
elif child.tag == qn('w:t'):
t_text = child.text
text += t_text if t_text is not None else ''
elif child.tag == qn('w:tab'):
text += '\t'
elif child.tag in (qn('w:br'), qn('w:cr')):
text += '\n'
elif child.tag == qn('w:p'):
text += '\n'
for ii in list(count_numId):
splt = str(ii) + '.ANILUHSOG'
count = False
count_2 = 2
text = text.split(splt)
for i in range(len(text)):
if text[i][0] == '0' and count == False:
text[i] = str(count_2 - 1) + '.' + text[i]
count = True
elif text[i][0] == '0' and count == True:
text[i] = str(count_2) + '.' + text[i]
count_2 += 1
elif text[i][0] != '0' and count == True and text[i][0] in '123456789':
text[i] = str(count_2 - 1) + '.' + text[i]
else:
pass
text = ''.join(text)
return text
Image example: https://yadi.sk/d/l3mORW4Rk2VEDA The last element (5th) in the above image example was made with custom lists numeration and 5 not a number, but a symbol. This stuff is still unsolved
Can you upload an example file somewhere?
Can you upload an example file somewhere?
https://github.com/goshulina/docx2txt/blob/master/app.py
Thank you. I'm looking for an example of an enumerated list document that failed. I'd like to make sure my own module can handle such documents.