Jeremy Singer-Vine
Jeremy Singer-Vine
Hi @o10baird, and thanks for providing the PDF, which is very helpful. What seems to be happening here is that the PDF contains a bunch of explicit _whitespace characters_ (rather...
Thank you for raising this issue. Please try updating to the latest version of `pdfplumber`. Do you still encounter the problem? If so, can you share a fully-reproducible script?
Hi @dalinautoagents, those code snippets reference external unstated variables and also combine image-related processing with other logic, creating an obstacle to reproduction. Could you create a simplified Python script that...
Thanks for opening this issue, @gnadlr; and thanks for your contributions to the related discussion and other recent ones, @cmdlineluser! Some observations: - 'aaaa bbbb' and '1111' do not seem...
> * Your issue here and the sample PDF also helped me to diagnose a bug in the way `pdfplumber` handles `use_text_flow=True`. Hoping to push a fix for this soon,...
> From what I can find, the clipping commands are currently no-ops in pdfminer: https://github.com/pdfminer/pdfminer.six/issues/414 - I'm not sure if this is something that needs to be supported in order...
Ah, I see; this is a good motivation for me to write more comprehensive documentation about how word segmentation works in pdfplumber. Until then: - With the default parameters, pdfplumber...
> This is correctly what I was trying to say (though it should be next_char["x0"] > curr_char["x1"] + x_tolerance?). Thanks! Updated the comment to fix that. > If detection of...
Really interesting, thanks for sharing @cmdlineluser. I think you're right about those [layers being created by marked-content commands](https://mupdf.readthedocs.io/en/latest/search.html?q=layer&check_keywords=yes&area=default). As it happens @dhdaines is doing some experimentation with extracting those sections...
Thanks for the notes, @dhdaines. Thoughts/responses below: > This can be problematic because marked content section boundaries can show up just about anywhere - take [this PDF](https://ville.sainte-adele.qc.ca/upload/documents/RGL-1310-Identification-Maple-Leaf-adoption.pdf) for example, running:...