David Huggins-Daines

Results 360 comments of David Huggins-Daines

The test failure above is some random GitHub failure, not an actual problem with this PR.

Okay, should be good with plain setuptools now! I can't test the github workflows though...

Same problem as #1036 - again, try to copy and paste text out of the file and you will see that the mappings are just nonsense.

Oh, it could be that pdfminer has an old or broken version of `UniKS-UTF16-H` encoding - @mk-docenty can you try copying/pasting from Adobe Acrobat? (I just tried poppler and Chrome,...

Also @nnurmano this is the first I have heard of llamaparse. It appears to maybe be proprietary? Do you know what they are actually using to extract text from PDF?

Some more digging in that PDF - the `UniKS-UTF16-H` encoding is only used by the font `MalgunGothic`, which is only used in the Form XObject on the first page. FontForge...

> I am writing a tool that reconstructs CID ToUnicode mapping. Verify. > > [2023_._.9._.-7-12.reconcstructed.pdf](https://github.com/user-attachments/files/22302148/2023_._.9._.-7-12.reconcstructed.pdf) not fixed here either

What is the expected behaviour here? If the shape was painted with the even-odd rule in the PDF, then `evenodd` will be set on all of its subpaths. This seems...

On further investigation it appears that this is related to https://github.com/jsvine/pdfplumber/issues/1057, which is related to #861 and #963. I'm still not quite sure what the expected behaviour should be, though....

> The porous shape in the example is actually a full path, but it's split into multiple LTCurve shapes for rectangular detection, which I guess is what caused the problem....