Ali MJ Al-Nasrawy

Results 27 comments of Ali MJ Al-Nasrawy

For testing, I uploaded these files to Drive: Test1.bin Size: 4MiB MD5: dfd8e64a8ee7e7015b89f232bfd3e254 Download URL: https://docs.google.com/uc?export=download&id=1cYtlZ92wp4sDrVgcEHLkJxCIme_yHzyO Browser URL: https://drive.google.com/open?id=1cYtlZ92wp4sDrVgcEHLkJxCIme_yHzyO Test2.bin Size: 42MiB MD5: 78b9539b59f84a687b686ca6787aec57 Download URL: https://docs.google.com/uc?export=download&id=1FrlXvDpR0sV8wvo_DlalbJlwnVrB0s8x Browser URL: https://drive.google.com/open?id=1FrlXvDpR0sV8wvo_DlalbJlwnVrB0s8x

To provide some context, this issue, along with #1378, resulted in a poor initial user experience for me. When working with a 1200-page file, the scan phase alone displayed an...

> Hmm. This is interesting - would my proposal to discard the lossy buffering entirely and simply read the entire stream into memory help or hurt in this case? Hmm,...

I don't think this got fixed! Even after [f77f701](https://github.com/ocrmypdf/OCRmyPDF/commit/f77f701a50d23d344ee7e0cb8ad04a53b5a1d903) the test file takes >3hr to scan and, when reverting the [regressed commit](https://github.com/ocrmypdf/OCRmyPDF/commit/d35d00880687177c3d8de6371738e2d50bddda47), it takes

Here is my patch, it passed all tests locally. ```diff diff --git a/src/ocrmypdf/pdfinfo/layout.py b/src/ocrmypdf/pdfinfo/layout.py index 83d5d32b..fbef6a0d 100644 --- a/src/ocrmypdf/pdfinfo/layout.py +++ b/src/ocrmypdf/pdfinfo/layout.py @@ -288,6 +288,23 @@ def patch_pdfminer(pscript5_mode: bool): else: yield...

For testing, this is a file with 10k blank pages: [blank-10k.pdf](https://github.com/user-attachments/files/16637412/blank-10k.pdf) Real-world files require much less pages for this bug to be noticeable. The test was done under `v16.4.2`, because...

> I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for...