leptonica
leptonica copied to clipboard
Regression: file not found on MacOS when opening /tmp file
Starting 05398d6c593893c4ee9706002218354558513e9a 1.84.0, on darwin MacOS, leptonica gives an error when opening a file in /tmp. Also, the error message does not give the actual path that it tried to open. For example, here is a program (based on tesseract.cpp):
#include <allheaders.h>
int main(int argc, char* argv[]) {
const char* image = "/tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png";
struct Pix *pixs = pixRead(image);
if (!pixs) {
fprintf(stderr, "Leptonica can't process input file: %s\n", image);
return 2;
}
return 0;
}
It gives this output:
Leptonica Error in fopenReadStream: file not found: 000011_ocr.png
Leptonica Error in pixRead: image file not found: /tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png
Leptonica can't process input file: /tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png
This affects ocrmypdf
when TMPDIR=/tmp
, which uses tesseract
, which calls leptonica:
nix-shell -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/4b8e9717fac859f830fa318a0cc1e2d4a40df152.tar.gz -p ocrmypdf --run 'ocrmypdf --redo-ocr --verbose=1 --keep-temporary-files ~/Downloads/20231017_TransferTaxExemptionMeasure.pdf ~/Downloads/20231017_TransferTaxExemptionMeasure-ocr.pdf'
…
1 Running: ['/nix/store/pgz54swxlbxc2lxx23ramcfz099v7n6z-tesseract-5.3.3/bin/tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', __init__.py:134
'/tmp/ocrmypdf.io.xu77l3_5/000001_ocr.png', '/tmp/ocrmypdf.io.xu77l3_5/000001_ocr_tess', 'pdf', 'txt']
1 Leptonica Error in fopenReadStream: file not found: 000001_ocr.png tesseract.py:252
1 Leptonica Error in findFileFormat: image file not found: /tmp/ocrmypdf.io.xu77l3_5/000001_ocr.png tesseract.py:252
1 Leptonica Error in fopenReadStream: file not found: PNG tesseract.py:252
1 Leptonica Error in pixRead: image file not found: PNG tesseract.py:252
(note: https://github.com/NixOS/nixpkgs/commit/4b8e9717fac859f830fa318a0cc1e2d4a40df152 is the first commit that contains both the https://github.com/NixOS/nixpkgs/commit/628b90b5ad0a526dba2daeb17d07ce248f0c5275 and a fix for an unrelated error “Abort trap: 6 mutool -v” https://github.com/NixOS/nixpkgs/commit/11498aed21cfdc45e93d8243e6458d8883d45214 )
Workaround: Set TMPDIR=/private/tmp instead of /tmp before invoking ocrmypdf
@stweil
I remember a recent proposal to allow TMPDIR path rewrites for MacOS, but I believe it was shelved. This has been an issue for quite a while. We solved it for Windows by allowing path rewrites and universally using genPathname() and fopenReadStream(). These packaging issues are of course well above my pay grade.
Yonathan also points out that fopenReadStream() is not giving the path when it can't open the file locally. We can give more information at that failure point; e.g. replace line 1896 by
lept_stderr("Failed in %s to open locally with tail %s "
"for filename %s\n", __func__, tail, filename);
Oops, one should always use the error macros for error messages, not lept_stderr
L_ERROR("failed to open locally with tail %s for filename %s\n",
__func__, tail, filename);