leptonica icon indicating copy to clipboard operation
leptonica copied to clipboard

Regression: file not found on MacOS when opening /tmp file

Open yonran opened this issue 1 year ago • 2 comments

Starting 05398d6c593893c4ee9706002218354558513e9a 1.84.0, on darwin MacOS, leptonica gives an error when opening a file in /tmp. Also, the error message does not give the actual path that it tried to open. For example, here is a program (based on tesseract.cpp):

#include <allheaders.h>

int main(int argc, char* argv[]) {
    const char* image = "/tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png";
    struct Pix *pixs = pixRead(image);
    if (!pixs) {
      fprintf(stderr, "Leptonica can't process input file: %s\n", image);
      return 2;
    }
    return 0;
}

It gives this output:

Leptonica Error in fopenReadStream: file not found: 000011_ocr.png
Leptonica Error in pixRead: image file not found: /tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png
Leptonica can't process input file: /tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png

This affects ocrmypdf when TMPDIR=/tmp, which uses tesseract, which calls leptonica:

nix-shell -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/4b8e9717fac859f830fa318a0cc1e2d4a40df152.tar.gz -p ocrmypdf --run 'ocrmypdf --redo-ocr --verbose=1 --keep-temporary-files ~/Downloads/20231017_TransferTaxExemptionMeasure.pdf ~/Downloads/20231017_TransferTaxExemptionMeasure-ocr.pdf'
…
    1 Running: ['/nix/store/pgz54swxlbxc2lxx23ramcfz099v7n6z-tesseract-5.3.3/bin/tesseract', '-l', 'eng', '-c', 'textonly_pdf=1',   __init__.py:134
'/tmp/ocrmypdf.io.xu77l3_5/000001_ocr.png', '/tmp/ocrmypdf.io.xu77l3_5/000001_ocr_tess', 'pdf', 'txt']                                             
    1  Leptonica Error in fopenReadStream: file not found: 000001_ocr.png                                                          tesseract.py:252
    1  Leptonica Error in findFileFormat: image file not found: /tmp/ocrmypdf.io.xu77l3_5/000001_ocr.png                           tesseract.py:252
    1  Leptonica Error in fopenReadStream: file not found: PNG                                                                     tesseract.py:252
    1  Leptonica Error in pixRead: image file not found: PNG                                                                       tesseract.py:252

(note: https://github.com/NixOS/nixpkgs/commit/4b8e9717fac859f830fa318a0cc1e2d4a40df152 is the first commit that contains both the https://github.com/NixOS/nixpkgs/commit/628b90b5ad0a526dba2daeb17d07ce248f0c5275 and a fix for an unrelated error “Abort trap: 6 mutool -v” https://github.com/NixOS/nixpkgs/commit/11498aed21cfdc45e93d8243e6458d8883d45214 )

Workaround: Set TMPDIR=/private/tmp instead of /tmp before invoking ocrmypdf

yonran avatar Feb 17 '24 08:02 yonran

@stweil

I remember a recent proposal to allow TMPDIR path rewrites for MacOS, but I believe it was shelved. This has been an issue for quite a while. We solved it for Windows by allowing path rewrites and universally using genPathname() and fopenReadStream(). These packaging issues are of course well above my pay grade.

Yonathan also points out that fopenReadStream() is not giving the path when it can't open the file locally. We can give more information at that failure point; e.g. replace line 1896 by

        lept_stderr("Failed in %s to open locally with tail %s " 
                    "for filename %s\n", __func__, tail, filename);

DanBloomberg avatar Feb 18 '24 08:02 DanBloomberg

Oops, one should always use the error macros for error messages, not lept_stderr

        L_ERROR("failed to open locally with tail %s for filename %s\n",
                __func__, tail, filename);

DanBloomberg avatar Feb 18 '24 18:02 DanBloomberg