leptonica
leptonica copied to clipboard
Test suite fails without external image libraries
I am building Leptonica for use in a software system where image I/O and compression are handled by other components. Therefore, I configure with
--without-zlib
--without-libpng
--without-jpeg
--without-giflib
--without-libtiff
--without-libwebp
--without-libopenjpeg
When I build in this manner, however, make check
gives rather discouraging results:
FAIL: adaptmap_reg
FAIL: adaptnorm_reg
FAIL: affine_reg
FAIL: alphaops_reg
FAIL: alphaxform_reg
SKIP: baseline_reg
FAIL: bilateral2_reg
FAIL: bilinear_reg
FAIL: binarize_reg
[...]
============================================================================
Testsuite summary for leptonica 1.78.0
============================================================================
# TOTAL: 125
# PASS: 3
# SKIP: 18
# XFAIL: 0
# FAIL: 104
# XPASS: 0
# ERROR: 0
The adaptmap_reg.log
file begins with
////////////////////////////////////////////////
//////////////// adaptmap_reg ///////////////
////////////////////////////////////////////////
leptonica-1.78.0 : (null)
Error in pixReadStreamJpeg: function not present
Error in pixReadStream: jpeg: no pix returned
Error in pixRead: pix not read
Error in pixConvertRGBToGray: pixs not defined
Error in pixaAddPix: pix not defined
[...]
I think it would make more sense for the image files that are used in the library's test suite to be distributed in a format that the library can read without external dependencies, such as bmp or pnm. (Of course, this would not apply in instances where I/O involving formats that require external dependencies is the functionality being tested.)
If I were to have all the image files that are used for testing in uncompressed format, they would take up about 15 times the current disk space, which I've been careful to keep down to about 10 MB. The approx. 200 jpg, png and tiff compressed image files would balloon to an estimated 150 MB.
As I understand, bmp format still does RLE, and the image files would be compressed within the distribution tarball (or .zip file) in any event. Would that improve the situation much?
Another possibility: add a set of Make rules that use some external tool (e.g. ImageMagick's convert
utility) to convert the image files into bmp/pnm, combined with a fallback in the code to load in such a format if the normal jpg/png/tiff support is unavailable.
It doesn't make much sense to me.
It's hard to imagine a user who wants to run the test suite, but doesn't have the standard image i/o libraries. I have deliberately limited as far as possible the dependencies of leptonica on external libraries, but virtually all users will have png, gz, jpeg and tiff. I consider webp, jp2k and gif optional -- webp and jp2k because they are not yet widely used, and gif because in every way it is inferior to png.
You can make leptonica without libraries, and the I/O gets appropriately stubbed out.
I implemented the original (most basic) bmp utility, that has no compression. It's there because people sometimes use it, like pnm, but has nothing else to justify putting effort in.
One could rely on gzip or bzip2 to compress the tarball, but when it is uncompressed, you have a big expansion of files in the directory. Not a nice thing to do the the users.
And then the entire test suite would need to be converted to use these uncompressed files.
-- Dan
The problem is not that those libraries are unavailable. (Although they could be, if you are building in a more unusual environment, like an embedded system.) The problem is that if you are preparing a build that deliberately does not use those libraries, then you don't have a way of validating that build. The image I/O is incidental to most of the tests, so in theory, having the images in a different format does not defeat their purpose.
(Of course, I could perform a separate build of the same source, with the external libraries enabled, and run the test suite using that. However, this doesn't actually validate the build that would be used, and could potentially let slip some unexpected error condition that results in part due to missing external libraries.)
How about the alternative option, then? I can provide a set of makefile rules that a user can invoke to automatically convert all the jpeg/png/tiff files into bmp/pnm. All that would be needed in the code, then, is that whenever an attempt to load e.g. foo.png
fails, it then falls back to attempting foo.bmp
(or pnm). This would leave the majority of users unaffected, while still allowing special-purpose builds to be run through the test suite.
It's too complicated. We support building using my hand-made makefiles, autoconf and cmake. How many libraries do you know that offer multiple ways to build?
You can build and run the tests using either the static makefiles or autoconf. cmake does not yet have the ability to run the tests after building (it's an old issue waiting implementation).
I can pretty well assure you that we're not going to complicate the github distro.
I'm not sure I understand where you see the complexity coming from. The use of multiple build systems would only suggest a different approach; it's not a showstopper. I will sketch out an idea of this below to illustrate.
(Note that I can currently build and run the tests, as shown above, but nearly all the tests fail as they cannot load the test images. As I mentioned previously, I could produce a separate build of the tests, with the external libraries, and run those---but then that build is not usable to me, due to various legal and release-engineering issues. At present, the situation is that a Leptonica build that does not have external libraries enabled is un-testable; there is no way to validate it.)
Here is one way how I see this working:
-
A
convert.sh
shell script is added to theprog
subdirectory. This script, when run inside its containing directory, calls some external tool (like ImageMagick'sconvert
or some of the Netpbm utilities) to convert the various jpg, png and tiff files present into e.g. bmp format, under a different filename. (For example,foo.jpg
results in afoo.jpg.bmp
file.) -
A new function is defined, something like
pixReadWithFallback()
, with the same signature aspixRead()
. If it cannot loadfoo.jpg
for whatever reason, it then attemptsfoo.jpg.bmp
, and then as a final measurefoo.jpg.pnm
.-
This functionality could also be rolled into the existing
pixRead()
, with some mechanism to enable it at run-time, but I think you would likely prefer a separate/new function. -
An additional convention may be used to make multi-page TIFF files available in fallback form; e.g.
.0.bmp
,.1.bmp
, and so on.
-
-
Tests are edited to use this new function instead of
pixRead()
. -
A user in my situation would need only know to run the
convert.sh
script before invokingmake check
, which should then complete without issue.
The majority of users who enable the external libraries as a matter of course are unlikely to even notice the added support for this scenario.
Pretty clever!
I don't want to put a fallback function into all the tests. However, as a compromise, suppose we make a script that converts those images files to bmp, but with the same name. Leptonica functions that read image data use the information in the header, not the filename extension, so a bmp file with a .png extension will be read properly as a bmp file.
Then we don't need to change the tests. You just run the script before the test.
I noticed that quality of pixRead()
. I don't like the idea of a file with extension X actually containing content Y, but if the tests are run in an out-of-source-tree build, then the files with the "wrong" extension can be created as temporary artifacts of the build. All that would be needed is the conversion script (which I can provide), and ensuring that the tests load files from the build tree in preference to the source tree.
I'll do some experimentation on this, and get back to you in a few days.
Okay, I've put the converted files into place, but have encountered an issue:
$ ./adaptmap_reg
////////////////////////////////////////////////
//////////////// adaptmap_reg ///////////////
////////////////////////////////////////////////
leptonica-1.79.0 : (null)
Time for gray adaptmap gen: 0.008
Error in pixWriteStreamPng: function not present
Error in pixWrite: pix not written to stream
Error in findFileFormatStream: truncated file
Error in fopenReadStream: file not found
[...]
Do the tests need to write out intermediate files using an external format?
Ah yes, the write side of the issue.
Do the tests need to write out intermediate files using an external format?
Yes, they are designed that way. The full set of tests generates about 340 MB of image files in generate mode, half in /tmp/lept/regout/ and half in /tmp/lept/golden/. Then in compare mode, 170 MB of image files are written to /tmp/lept/regout/.
For example, the function regTestWritePixAndCheck() is useful because of its simplicity (you don't need to specify the temp file names). It compares files byte-for-byte. The decision about format for the files is made for maximum compression. If uncompressed files were used, we'd generate 5 or 6 GB of data! There are 1000 invocations of this function alone in the regression tests.
If instead of writing to file, we wrote to memory, we would have no files available to look at when the test fails.
It's reasonable that more temporary space would be needed when uncompressed formats are the only ones available. Could we conditionalize the tests to not request png format when it is not compiled in?
You are persistent :-)
OK, I've done that. Need to handle png, jpeg and tiff_g4. Will push it out shortly.
Just trying to find a way to make this work :-)
Okay, I've pulled in your changes, and am seeing an increase in the number of tests passing. There are still a number of tests failing with "function not present":
$ grep -l 'function not present' *.log | fgrep -v test-suite.log | while read x; do echo $x:; grep 'function not present' $x | sort -u; echo; done
adaptmap_reg.log:
Error in pixWriteStreamJpeg: function not present
alphaops_reg.log:
Error in pixWriteStreamJpeg: function not present
Error in pixWriteStreamPng: function not present
baseline_reg.log:
Error in pixReadStreamPng: function not present
Error in pixWriteStreamPng: function not present
blend3_reg.log:
Error in pixWriteStreamJpeg: function not present
[...]
I'm getting better success with conversion to pnm, though in general there seem to be some issues with the way the image is encoded. One example is feyn.tif
not working correctly when converted to what appears to be a semantically equivalent bmp file. I have to investigate this further to better describe the issue, but it seems like there may be some interesting corner cases here.
These aren't corner cases -- there are 211 invocations of pixWrite() in the *_reg.c regression tests.
I can solve most of them by putting those same pre-processor statements in pixWrite().
done. Also included all variants of tiff compression supported by leptonica.
Thanks, I'm seeing fewer failures (94 pass, 36 fail). I looked through a few of the latter:
-
baseline_reg
fails due to passingGPLOT_PNG
togplotSimple1()
-
blend2_reg
fails because it cannot find/tmp/lept/regout/blend2.14.png
-
boxa2_reg
fails due toboxaPlotSizes()
passingGPLOT_PNG
togplotCreate()
-
bytea_reg
fails due toconvertPdf()
->pixGenerateCIData()
->pixGenerateJpegData()
. Is it feasible to have an uncompressed PDF encoding, for completeness? Else these PDF-related tests could be skipped when jpeg/zlib/etc. is unavailable. -
ccthin2_reg
fails due tol_bootnum_gen3()
callingzlibUncompress()
. Probably another one to skip when zlib is unavailable. -
coloring_reg
fails due to "pixa" fonts not working. Another probable skip (unlessprog/fonts/chars-*.tif
could be substituted). -
enhance_reg
fails due tosrc/bmfdata.h
containing TIFF data. Another skip.
When you print e.g. "Warning in pixWrite: png library missing; output bmp format", you may want to add a newline, because in the test output logs the following line of output starts immediately after "format".
Lastly, after seeing the new TIFF logic, may I suggest something like
#define L_FORMAT_IS_TIFF(f) ((f) >= IFF_TIFF && (f) <= IFF_TIFF_ZIP)
? Seems like that same OR-statement crops up a few times...
Yes, I forgot the newline, which should always end the message string in the L_* macros.
And thanks for the L_FORMAT_IS_TIFF macro.
Can one make a macro of macros; e.g., #define NO_EXTERNAL_IO_LIBS ???
A macro can certainly call other macros; what did you have in mind?
Looking over src/imageio.h
, specifically the IFF_*
enums, I was thinking... you already have a couple of symbolic entries in there (IFF_UNKNOWN
, IFF_DEFAULT
). What if there were one like IFF_BEST
, that would normally be equivalent to IFF_PNG
, but can become IFF_BMP
when necessary? Something like this could help abstract away a lot of the png references in the code.
This could potentially be taken further by combining it with a "symbolic" file extension like .best
, which when seen on a filename, is automatically changed to .png
or .bmp
as appropriate.
I don't think it's necessary. At this point, the two high level image writing functions, pixWrite() and pixWriteMem(), are instrumented to fall back to bmp.
Just noticed that I should put the format changer in pixWriteStream(), not in pixWrite(), because pixWrite() calls pixWriteStream(). Will do this.
The balance is now 101 pass, 29 fail.
Here are some suggested changes to the gplot code so that it can write pnm, and fall back to same when png is unavailable. (Additional tweaks are needed to get the tests passing; this is just a start.)
Some of the tests are going to need a fragment like
#if !defined(HAVE_LIBPNG)
L_ERROR("This test requires libpng to run.\n");
exit(77);
#endif
That will cause the test to return a result of SKIP in the "make check" run.
101 out of 130 seems GEFGW.
Thank you for the gplot patch. It doesn't save the tests, because we're writing to files named *.png, and gplot adds an appropriate extension, .pnm when outputting in pnm. But your fragment at least allows the failures to happen silently. Send me a list of tests that fail, and I'll add the fragment.
I was amused to see that the gplot pnm output has a lousy font for text, and is 200x bigger than the png output.
Skipped the regression tests that use gplot (they all emit png). You should now be at 101 pass, 17 fail.
Just realized that I can modify the gplot routines to return an image (png or pnm), simplifiying the procedure of generating it, and doing so without changing the gplotSimple1 (etc) interfaces.
Perhaps a weekend project.
It's certainly getting better! Here is an initial set of tests that can be skipped:
Test name Skip due to call to...
================ ==========================
boxa3_reg pixaGenerateFontFromString
boxa4_reg pixaGenerateFontFromString
ccthin2_reg l_bootnum_gen3
colorcontent_reg pixaGenerateFontFromString
coloring_reg pixaRead
colorize_reg pixaGenerateFontFromString
enhance_reg pixaGenerateFontFromString
genfonts_reg pixaGenerateFontFromString
insert_reg pixaWrite
mtiff_reg (multi-page TIFF functionality)
numa2_reg pixaGenerateFontFromString
ptra2_reg pixaWrite
quadtree_reg pixaGenerateFontFromString
rankhisto_reg pixaGenerateFontFromString
shear2_reg pixaRead
subpixel_reg pixaRead
writetext_reg pixaRead
Function Dependency
========================== ==========
l_bootnum_gen3 zlib
pixaGenerateFontFromString libtiff
pixaRead libpng
pixaWrite libpng
Separately: pngio_reg
just needs the exit(77)
bit.
This patch has a few further suggestions for the code:
-
gplot.c
: Slight logic tweak so that output is still written to a.png
file when aGPLOT_PNG
->GPLOT_PNM
downgrade takes place. (This aside, having the gplot routines return an image object directly seems like it will be a better way to go. But for now, this should at least allow the gplot-using tests to be enabled.) -
pdfio2.c
: Implemented downgrade toL_FLATE_DECODE
when zlib is available. This isn't a complete solution (there is no further fallback without zlib, not unless you implement an uncompressed PDF encoding), but zlib is usually available even when the other libs are not.
By the way, I ran into some trouble with the HAVE_LIB*
conditionals in pdfio2.c
when I noticed that the file was not #including config_auto.h
. I would suggest adding that #include to every .c file that doesn't already have it. Even if a source file doesn't use HAVE_*
et al., the config header could well define something like _XOPEN_SOURCE
or function replacements that affect normal (non-preprocessor-oriented) code.
With the above change to pixGenerateCIData()
, the pdfio1_reg
test can be skipped when zlib is unavailable.
There are a few remaining tests needing attention, but let's get this low-hanging fruit out of the way first.
Re gplot font: I'm not conversant in Gnuplot, but you might find that tweaking the terminal type gives you better results. I assume getting the same pixels as png output should be possible.
Coincindentally, another collaborator is independently fixing some build warnings, and he has been replacing #if defined(xxx) ==> #if XXX I believe you said that we should use #if defined(xxx) in general. Is that correct?
If we put #ifdef HAVE_CONFIG_H #include "config_auto.h" #endif /* HAVE_CONFIG_H */ in every source file, will that solve the problem and allow us to use #if defined(XXX) throughout?
I'd rather not do your gplot change with output_actual.
Instead my plan is to do this: In gplot, we can generate the pix directly with a single call, without having to specify a rootname and then generating the file name to read it back. And this can be done changing only gplotSimpleXY1(), ...XY2() and ... XYN() and adding new functions that call them: gplotSimplePix1(), gplotSimplePix2(), gplotSimplePixN()
Once we have these functions, the use (e.g., in the regression tests) will simply be through regTestWritePixAndCheck() We will not need to know the names of the temporary files that gplot creates.
Working late today...
For HAVE_*
symbols defined by Autoconf, yes: #if defined(HAVE_FOO)
(or #ifdef HAVE_FOO
) should be used. Because HAVE_FOO
will take one of the following forms in the config header:
#define HAVE_FOO 1
...or...
/* #undef HAVE_FOO */
Using #if HAVE_FOO
in the latter case only works due to compiler leniency (it's a common flub), and with -Wundef
will print a warning.
(I suspect your collaborator is favoring #if
due to what they saw in environ.h
, interpreting that as the convention used by the project. The problem, of course, is that environ.h
itself needs to change to follow the Autoconf convention.)
Adding config_auto.h
to every source file isn't about resolving the #if/#ifdef
issue, but about making all the HAVE_*
symbols available in the first place. The problem I was having with pdfio2.c
was that all my added code, which is surrounded by #if defined(HAVE_LIBZ)
, was not getting compiled in even though HAVE_LIBZ
was correctly #defined to 1 in config_auto.h
. (If I had incorrectly done #if HAVE_LIBZ
, the result would not have been any different.)
Re gplot changes: Hiding the temporary file(s) sounds great. It's a bit more work, but ultimately a better approach.
Doesn't it make more sense to add those 3 lines to environ.h, which is included with allheaders.h in every source file?
I am the collaborater. @iskunk is right that #if defined(HAVE_FOO)
or #ifdef HAVE_FOO
should be used with the Autoconf macros, but that requires modifications in src/environ.h
.
It is important that the definitions from config_auto.h
are included in each source file which uses HAVE_FOO
. Adding the three lines to src/environ.h
and removing them everywhere else looks like a reasonable solution for me.