pdf2docx icon indicating copy to clipboard operation
pdf2docx copied to clipboard

Handle index error in paragraphs

Open kcho-mirato opened this issue 3 years ago • 1 comments

Some documents can't be processed page by page due to an index error. As a result pages are blank. This small fix handles the exception are pages are being extracted as expected. I'm not sure, though, if it's best to skip the section (continue) or take last paragraph instead of the one before it (what I did). If you prefer the first option - let me know please.

kcho-mirato avatar Jul 18 '22 12:07 kcho-mirato

Hi @kcho-mirato , many thanks for your pull request.

I don't quite understand why doc.paragraphs[-2] might have index issue, because, at this moment, we have at least 2 sections, where doc.paragraphs[-1] is a virtual paragraph -> section break before current section, while doc.paragraphs[-2] is the last paragraph in previous section.

So, could you please provide the sample file causing index error? So I can look into it and check what I missed.

When the index error is confirmed, I'd prefer your fix as well.

dothinking avatar Jul 19 '22 14:07 dothinking

Hi @kcho-mirato , many thanks for your pull request.

I don't quite understand why doc.paragraphs[-2] might have index issue, because, at this moment, we have at least 2 sections, where doc.paragraphs[-1] is a virtual paragraph -> section break before current section, while doc.paragraphs[-2] is the last paragraph in previous section.

So, could you please provide the sample file causing index error? So I can look into it and check what I missed.

When the index error is confirmed, I'd prefer your fix as well.

Hi @dothinking , sorry for the late reply, I'm afraid the sample file is confidential, but I will try to figure something out and if we can strip the data so that the structure is preserved then I'll let you know.

kcho-mirato avatar Oct 06 '22 07:10 kcho-mirato

@kcho-mirato no problem. Thanks!

dothinking avatar Oct 06 '22 13:10 dothinking

Feel free to reopen it for further discussion.

dothinking avatar Jan 13 '24 12:01 dothinking