pdf2docx Handle index error in paragraphs

Some documents can't be processed page by page due to an index error. As a result pages are blank. This small fix handles the exception are pages are being extracted as expected. I'm not sure, though, if it's best to skip the section (continue) or take last paragraph instead of the one before it (what I did). If you prefer the first option - let me know please.

Jul 18 '22 12:07 kcho-mirato

Hi @kcho-mirato , many thanks for your pull request.

I don't quite understand why doc.paragraphs[-2] might have index issue, because, at this moment, we have at least 2 sections, where doc.paragraphs[-1] is a virtual paragraph -> section break before current section, while doc.paragraphs[-2] is the last paragraph in previous section.

So, could you please provide the sample file causing index error? So I can look into it and check what I missed.

When the index error is confirmed, I'd prefer your fix as well.

Jul 19 '22 14:07 dothinking

Hi @kcho-mirato , many thanks for your pull request.

I don't quite understand why doc.paragraphs[-2] might have index issue, because, at this moment, we have at least 2 sections, where doc.paragraphs[-1] is a virtual paragraph -> section break before current section, while doc.paragraphs[-2] is the last paragraph in previous section.

So, could you please provide the sample file causing index error? So I can look into it and check what I missed.

When the index error is confirmed, I'd prefer your fix as well.

Hi @dothinking , sorry for the late reply, I'm afraid the sample file is confidential, but I will try to figure something out and if we can strip the data so that the structure is preserved then I'll let you know.

Oct 06 '22 07:10 kcho-mirato

@kcho-mirato no problem. Thanks!

Oct 06 '22 13:10 dothinking

Feel free to reopen it for further discussion.

Jan 13 '24 12:01 dothinking

pdf2docx pdf2docx copied to clipboard

Handle index error in paragraphs

pdf2docx
pdf2docx copied to clipboard