pdf2docx
pdf2docx copied to clipboard
Handle index error in paragraphs
Some documents can't be processed page by page due to an index error. As a result pages are blank.
This small fix handles the exception are pages are being extracted as expected. I'm not sure, though, if it's best to skip the section (continue) or take last paragraph instead of the one before it (what I did).
If you prefer the first option - let me know please.
Hi @kcho-mirato , many thanks for your pull request.
I don't quite understand why doc.paragraphs[-2] might have index issue, because, at this moment, we have at least 2 sections, where doc.paragraphs[-1] is a virtual paragraph -> section break before current section, while doc.paragraphs[-2] is the last paragraph in previous section.
So, could you please provide the sample file causing index error? So I can look into it and check what I missed.
When the index error is confirmed, I'd prefer your fix as well.
Hi @kcho-mirato , many thanks for your pull request.
I don't quite understand why
doc.paragraphs[-2]might have index issue, because, at this moment, we have at least 2 sections, wheredoc.paragraphs[-1]is a virtual paragraph -> section break before current section, whiledoc.paragraphs[-2]is the last paragraph in previous section.So, could you please provide the sample file causing index error? So I can look into it and check what I missed.
When the index error is confirmed, I'd prefer your fix as well.
Hi @dothinking , sorry for the late reply, I'm afraid the sample file is confidential, but I will try to figure something out and if we can strip the data so that the structure is preserved then I'll let you know.
@kcho-mirato no problem. Thanks!
Feel free to reopen it for further discussion.