David Huggins-Daines comments

Results 360 comments of


                                            David Huggins-Daines

Use PLAYA instead of pdfminer

> > Really neat to see you developing this so rapidly, and great to hear about that speedup. > > Thanks! I keep on finding interesting bugs in pdfminer.six, unfortunately......

Image Extraction incomplete

To second @jsvine 's comment - an "image" in a PDF is very often not what you think it is, since PDF readers are also compositors. The only reliable way...

Update version of `pdfminer-six` to `20240706`

> There seems to be a bug in the latest release — [pdfminer/pdfminer.six#1004](https://github.com/pdfminer/pdfminer.six/issues/1004) — which also happens to be throwing errors in `pdfplumber`'s test suite. I'll keep an eye out...

Update version of `pdfminer-six` to `20240706`

You can use [PAVÉS](https://github.com/dhdaines/paves) now, it is mostly a drop-in replacement for pdfminer, except that it fixes a bunch of problems and is also somewhat faster. Can you try this...

Update version of `pdfminer-six` to `20240706`

> I previously installed **unstructured-ingest** project (which has gone for-profit now) with the hope to mine PDFs of court decisions which have caption boxes using ASCII: that makes for problematic...

original_path extraction error regarding LTCurve

See discussion at pdfminer above. The issue is that pdfminer doesn't apply any fill rules in layout analysis. Ideally, you should be looking at the `fill` attribute, not the `evenodd`...

[Bug]: Scan time regression in 16.4.3 with `--redo-ocr`

> I'm not inclined to lower BUFSIZ, because it is the maximum value to read, and the token splitting issue is much more painful (wrong behavior and in bizarre, complex...

[Bug]: Scan time regression in 16.4.3 with `--redo-ocr`

I really like the pure-Python-ness of pdfminer.six but it has a *lot* of quirks, some of which have to be worked around in less-than-robust ways. For its main use case...

[Bug]: Scan time regression in 16.4.3 with `--redo-ocr`

> The issue is that pdfminer discards its buffer and then refills it every time `seek` is called ([source](https://github.com/pdfminer/pdfminer.six/blob/1a8bd2f730295b31d6165e4d95fcb5a03793c978/pdfminer/psparser.py#L200)) even if the the seek target is within the previous buffer....

[Bug]: Scan time regression in 16.4.3 with `--redo-ocr`

> * I believe the changes in your PR are suffiecient as it is. Increasing BUFSIZ is a temporary workaround that should be removed once your PR is merged. Ah,...