python-docx2txt icon indicating copy to clipboard operation
python-docx2txt copied to clipboard

Save list numeration

Open goshulina opened this issue 6 years ago • 4 comments

How to save list numeration?

After small research i learned how to gather numbers of lists. I simply done this in docx2txt.py:

def xml2text(xml):
    ...
    for n, child in enumerate(root.iter()):
        if child.tag == qn('w:ilvl'):
            text += '\nGSHNCHIK' + child.attrib[qn('w:val')] + ' ' ### GSHNCHIK is just unique part to further split
        elif child.tag == qn('w:numId'):
            t_val = child.attrib[qn('w:val')] + '.'
            textA, textB = text.split('GSHNCHIK')
            text = textA + t_val + textB
        elif child.tag == qn('w:t'):
            t_text = child.text
            text += t_text if t_text is not None else ''
        elif child.tag == qn('w:tab'):
            text += '\t'
        elif child.tag in (qn('w:br'), qn('w:cr')):
            text += '\n'
        elif child.tag == qn('w:p'):
            text += '\n\n'
    return text

After this i am able to obtain "numIds" and "ilvls" attribute values (where numIds stands for list item and ilvl for level of list) But what if my list has custom numeration in docx document? Like following toy example: 1)First item 3)Second item 123) Third item

I couldn't find any dependences to this in styles.xml or whatever. is it really possible?

goshulina avatar Oct 02 '18 07:10 goshulina

Rewrote xml2text to this. Now lists are shown well

def xml2text(xml):
    text = u''
    root = ET.fromstring(xml)
    count_numId = set()
    compare = False
    for n, child in enumerate(root.iter()):
        if child.tag == qn('w:ilvl'):
            text += '\nKIHCNHSGANILUHSOG' + child.attrib[qn('w:val')] + ' '  ### KIHCNHSGANILUHSOG is just a random stuff, unique part to further splits
        elif child.tag == qn('w:numId'):
            t_val = child.attrib[qn('w:val')] + '.'
            textA, textB = text.split('KIHCNHSG')
            text = textA + t_val + textB
            try:
                compare = max(list(count_numId)) > int(child.attrib[qn('w:val')])
            except:
                pass
            if compare:
                pass
            else:
                count_numId.update({int(child.attrib[qn('w:val')])})
        elif child.tag == qn('w:t'):
            t_text = child.text
            text += t_text if t_text is not None else ''
        elif child.tag == qn('w:tab'):
            text += '\t'
        elif child.tag in (qn('w:br'), qn('w:cr')):
            text += '\n'
        elif child.tag == qn('w:p'):
            text += '\n'
    for ii in list(count_numId):
        splt = str(ii) + '.ANILUHSOG'
        count = False
        count_2 = 2
        text = text.split(splt)
        for i in range(len(text)):
            if text[i][0] == '0' and count == False:
                text[i] = str(count_2 - 1) + '.' + text[i]
                count = True
            elif text[i][0] == '0' and count == True:
                text[i] = str(count_2) + '.' + text[i]
                count_2 += 1
            elif text[i][0] != '0' and count == True and text[i][0] in '123456789':
                text[i] = str(count_2 - 1) + '.' + text[i]
            else:
                pass
        text = ''.join(text)
    return text

Image example: https://yadi.sk/d/l3mORW4Rk2VEDA The last element (5th) in the above image example was made with custom lists numeration and 5 not a number, but a symbol. This stuff is still unsolved

goshulina avatar Oct 09 '18 17:10 goshulina

Can you upload an example file somewhere?

ShayHill avatar Jul 18 '19 16:07 ShayHill

Can you upload an example file somewhere?

https://github.com/goshulina/docx2txt/blob/master/app.py

goshulina avatar Oct 25 '19 19:10 goshulina

Thank you. I'm looking for an example of an enumerated list document that failed. I'd like to make sure my own module can handle such documents.

ShayHill avatar Oct 27 '19 14:10 ShayHill