leptonica icon indicating copy to clipboard operation
leptonica copied to clipboard

Test suite fails without external image libraries

Open iskunk opened this issue 5 years ago • 48 comments

I am building Leptonica for use in a software system where image I/O and compression are handled by other components. Therefore, I configure with

--without-zlib
--without-libpng
--without-jpeg
--without-giflib
--without-libtiff
--without-libwebp
--without-libopenjpeg

When I build in this manner, however, make check gives rather discouraging results:

FAIL: adaptmap_reg
FAIL: adaptnorm_reg
FAIL: affine_reg
FAIL: alphaops_reg
FAIL: alphaxform_reg
SKIP: baseline_reg
FAIL: bilateral2_reg
FAIL: bilinear_reg
FAIL: binarize_reg
[...]
============================================================================
Testsuite summary for leptonica 1.78.0
============================================================================
# TOTAL: 125
# PASS:  3
# SKIP:  18
# XFAIL: 0
# FAIL:  104
# XPASS: 0
# ERROR: 0

The adaptmap_reg.log file begins with


////////////////////////////////////////////////
////////////////   adaptmap_reg   ///////////////
////////////////////////////////////////////////
leptonica-1.78.0 : (null)
Error in pixReadStreamJpeg: function not present
Error in pixReadStream: jpeg: no pix returned
Error in pixRead: pix not read
Error in pixConvertRGBToGray: pixs not defined
Error in pixaAddPix: pix not defined
[...]

I think it would make more sense for the image files that are used in the library's test suite to be distributed in a format that the library can read without external dependencies, such as bmp or pnm. (Of course, this would not apply in instances where I/O involving formats that require external dependencies is the functionality being tested.)

iskunk avatar Oct 28 '19 21:10 iskunk

If I were to have all the image files that are used for testing in uncompressed format, they would take up about 15 times the current disk space, which I've been careful to keep down to about 10 MB. The approx. 200 jpg, png and tiff compressed image files would balloon to an estimated 150 MB.

DanBloomberg avatar Oct 29 '19 22:10 DanBloomberg

As I understand, bmp format still does RLE, and the image files would be compressed within the distribution tarball (or .zip file) in any event. Would that improve the situation much?

Another possibility: add a set of Make rules that use some external tool (e.g. ImageMagick's convert utility) to convert the image files into bmp/pnm, combined with a fallback in the code to load in such a format if the normal jpg/png/tiff support is unavailable.

iskunk avatar Oct 29 '19 22:10 iskunk

It doesn't make much sense to me.

It's hard to imagine a user who wants to run the test suite, but doesn't have the standard image i/o libraries. I have deliberately limited as far as possible the dependencies of leptonica on external libraries, but virtually all users will have png, gz, jpeg and tiff. I consider webp, jp2k and gif optional -- webp and jp2k because they are not yet widely used, and gif because in every way it is inferior to png.

You can make leptonica without libraries, and the I/O gets appropriately stubbed out.

I implemented the original (most basic) bmp utility, that has no compression. It's there because people sometimes use it, like pnm, but has nothing else to justify putting effort in.

One could rely on gzip or bzip2 to compress the tarball, but when it is uncompressed, you have a big expansion of files in the directory. Not a nice thing to do the the users.

And then the entire test suite would need to be converted to use these uncompressed files.

-- Dan

DanBloomberg avatar Oct 29 '19 23:10 DanBloomberg

The problem is not that those libraries are unavailable. (Although they could be, if you are building in a more unusual environment, like an embedded system.) The problem is that if you are preparing a build that deliberately does not use those libraries, then you don't have a way of validating that build. The image I/O is incidental to most of the tests, so in theory, having the images in a different format does not defeat their purpose.

(Of course, I could perform a separate build of the same source, with the external libraries enabled, and run the test suite using that. However, this doesn't actually validate the build that would be used, and could potentially let slip some unexpected error condition that results in part due to missing external libraries.)

How about the alternative option, then? I can provide a set of makefile rules that a user can invoke to automatically convert all the jpeg/png/tiff files into bmp/pnm. All that would be needed in the code, then, is that whenever an attempt to load e.g. foo.png fails, it then falls back to attempting foo.bmp (or pnm). This would leave the majority of users unaffected, while still allowing special-purpose builds to be run through the test suite.

iskunk avatar Oct 29 '19 23:10 iskunk

It's too complicated. We support building using my hand-made makefiles, autoconf and cmake. How many libraries do you know that offer multiple ways to build?

You can build and run the tests using either the static makefiles or autoconf. cmake does not yet have the ability to run the tests after building (it's an old issue waiting implementation).

I can pretty well assure you that we're not going to complicate the github distro.

DanBloomberg avatar Oct 30 '19 00:10 DanBloomberg

I'm not sure I understand where you see the complexity coming from. The use of multiple build systems would only suggest a different approach; it's not a showstopper. I will sketch out an idea of this below to illustrate.

(Note that I can currently build and run the tests, as shown above, but nearly all the tests fail as they cannot load the test images. As I mentioned previously, I could produce a separate build of the tests, with the external libraries, and run those---but then that build is not usable to me, due to various legal and release-engineering issues. At present, the situation is that a Leptonica build that does not have external libraries enabled is un-testable; there is no way to validate it.)

Here is one way how I see this working:

  • A convert.sh shell script is added to the prog subdirectory. This script, when run inside its containing directory, calls some external tool (like ImageMagick's convert or some of the Netpbm utilities) to convert the various jpg, png and tiff files present into e.g. bmp format, under a different filename. (For example, foo.jpg results in a foo.jpg.bmp file.)

  • A new function is defined, something like pixReadWithFallback(), with the same signature as pixRead(). If it cannot load foo.jpg for whatever reason, it then attempts foo.jpg.bmp, and then as a final measure foo.jpg.pnm.

    • This functionality could also be rolled into the existing pixRead(), with some mechanism to enable it at run-time, but I think you would likely prefer a separate/new function.

    • An additional convention may be used to make multi-page TIFF files available in fallback form; e.g. .0.bmp, .1.bmp, and so on.

  • Tests are edited to use this new function instead of pixRead().

  • A user in my situation would need only know to run the convert.sh script before invoking make check, which should then complete without issue.

The majority of users who enable the external libraries as a matter of course are unlikely to even notice the added support for this scenario.

iskunk avatar Oct 30 '19 17:10 iskunk

Pretty clever!

I don't want to put a fallback function into all the tests. However, as a compromise, suppose we make a script that converts those images files to bmp, but with the same name. Leptonica functions that read image data use the information in the header, not the filename extension, so a bmp file with a .png extension will be read properly as a bmp file.

Then we don't need to change the tests. You just run the script before the test.

DanBloomberg avatar Oct 30 '19 22:10 DanBloomberg

I noticed that quality of pixRead(). I don't like the idea of a file with extension X actually containing content Y, but if the tests are run in an out-of-source-tree build, then the files with the "wrong" extension can be created as temporary artifacts of the build. All that would be needed is the conversion script (which I can provide), and ensuring that the tests load files from the build tree in preference to the source tree.

I'll do some experimentation on this, and get back to you in a few days.

iskunk avatar Oct 30 '19 22:10 iskunk

Okay, I've put the converted files into place, but have encountered an issue:

$ ./adaptmap_reg 

////////////////////////////////////////////////
////////////////   adaptmap_reg   ///////////////
////////////////////////////////////////////////
leptonica-1.79.0 : (null)
Time for gray adaptmap gen:   0.008
Error in pixWriteStreamPng: function not present
Error in pixWrite: pix not written to stream
Error in findFileFormatStream: truncated file
Error in fopenReadStream: file not found
[...]

Do the tests need to write out intermediate files using an external format?

iskunk avatar Nov 05 '19 15:11 iskunk

Ah yes, the write side of the issue.

Do the tests need to write out intermediate files using an external format?

Yes, they are designed that way. The full set of tests generates about 340 MB of image files in generate mode, half in /tmp/lept/regout/ and half in /tmp/lept/golden/. Then in compare mode, 170 MB of image files are written to /tmp/lept/regout/.

For example, the function regTestWritePixAndCheck() is useful because of its simplicity (you don't need to specify the temp file names). It compares files byte-for-byte. The decision about format for the files is made for maximum compression. If uncompressed files were used, we'd generate 5 or 6 GB of data! There are 1000 invocations of this function alone in the regression tests.

If instead of writing to file, we wrote to memory, we would have no files available to look at when the test fails.

DanBloomberg avatar Nov 05 '19 16:11 DanBloomberg

It's reasonable that more temporary space would be needed when uncompressed formats are the only ones available. Could we conditionalize the tests to not request png format when it is not compiled in?

iskunk avatar Nov 05 '19 17:11 iskunk

You are persistent :-)

OK, I've done that. Need to handle png, jpeg and tiff_g4. Will push it out shortly.

DanBloomberg avatar Nov 05 '19 18:11 DanBloomberg

Just trying to find a way to make this work :-)

Okay, I've pulled in your changes, and am seeing an increase in the number of tests passing. There are still a number of tests failing with "function not present":

$ grep -l 'function not present' *.log | fgrep -v test-suite.log | while read x; do echo $x:; grep 'function not present' $x | sort -u; echo; done
adaptmap_reg.log:
Error in pixWriteStreamJpeg: function not present

alphaops_reg.log:
Error in pixWriteStreamJpeg: function not present
Error in pixWriteStreamPng: function not present

baseline_reg.log:
Error in pixReadStreamPng: function not present
Error in pixWriteStreamPng: function not present

blend3_reg.log:
Error in pixWriteStreamJpeg: function not present
[...]

I'm getting better success with conversion to pnm, though in general there seem to be some issues with the way the image is encoded. One example is feyn.tif not working correctly when converted to what appears to be a semantically equivalent bmp file. I have to investigate this further to better describe the issue, but it seems like there may be some interesting corner cases here.

iskunk avatar Nov 05 '19 20:11 iskunk

These aren't corner cases -- there are 211 invocations of pixWrite() in the *_reg.c regression tests.

I can solve most of them by putting those same pre-processor statements in pixWrite().

DanBloomberg avatar Nov 06 '19 00:11 DanBloomberg

done. Also included all variants of tiff compression supported by leptonica.

DanBloomberg avatar Nov 06 '19 00:11 DanBloomberg

Thanks, I'm seeing fewer failures (94 pass, 36 fail). I looked through a few of the latter:

  • baseline_reg fails due to passing GPLOT_PNG to gplotSimple1()

  • blend2_reg fails because it cannot find /tmp/lept/regout/blend2.14.png

  • boxa2_reg fails due to boxaPlotSizes() passing GPLOT_PNG to gplotCreate()

  • bytea_reg fails due to convertPdf() -> pixGenerateCIData() -> pixGenerateJpegData(). Is it feasible to have an uncompressed PDF encoding, for completeness? Else these PDF-related tests could be skipped when jpeg/zlib/etc. is unavailable.

  • ccthin2_reg fails due to l_bootnum_gen3() calling zlibUncompress(). Probably another one to skip when zlib is unavailable.

  • coloring_reg fails due to "pixa" fonts not working. Another probable skip (unless prog/fonts/chars-*.tif could be substituted).

  • enhance_reg fails due to src/bmfdata.h containing TIFF data. Another skip.

When you print e.g. "Warning in pixWrite: png library missing; output bmp format", you may want to add a newline, because in the test output logs the following line of output starts immediately after "format".

Lastly, after seeing the new TIFF logic, may I suggest something like

#define L_FORMAT_IS_TIFF(f) ((f) >= IFF_TIFF && (f) <= IFF_TIFF_ZIP)

? Seems like that same OR-statement crops up a few times...

iskunk avatar Nov 06 '19 16:11 iskunk

Yes, I forgot the newline, which should always end the message string in the L_* macros.

And thanks for the L_FORMAT_IS_TIFF macro.

Can one make a macro of macros; e.g., #define NO_EXTERNAL_IO_LIBS ???

DanBloomberg avatar Nov 06 '19 16:11 DanBloomberg

A macro can certainly call other macros; what did you have in mind?

Looking over src/imageio.h, specifically the IFF_* enums, I was thinking... you already have a couple of symbolic entries in there (IFF_UNKNOWN, IFF_DEFAULT). What if there were one like IFF_BEST, that would normally be equivalent to IFF_PNG, but can become IFF_BMP when necessary? Something like this could help abstract away a lot of the png references in the code.

This could potentially be taken further by combining it with a "symbolic" file extension like .best, which when seen on a filename, is automatically changed to .png or .bmp as appropriate.

iskunk avatar Nov 06 '19 18:11 iskunk

I don't think it's necessary. At this point, the two high level image writing functions, pixWrite() and pixWriteMem(), are instrumented to fall back to bmp.

DanBloomberg avatar Nov 07 '19 19:11 DanBloomberg

Just noticed that I should put the format changer in pixWriteStream(), not in pixWrite(), because pixWrite() calls pixWriteStream(). Will do this.

DanBloomberg avatar Nov 07 '19 19:11 DanBloomberg

The balance is now 101 pass, 29 fail.

Here are some suggested changes to the gplot code so that it can write pnm, and fall back to same when png is unavailable. (Additional tweaks are needed to get the tests passing; this is just a start.)

Some of the tests are going to need a fragment like

#if !defined(HAVE_LIBPNG)
  L_ERROR("This test requires libpng to run.\n");
  exit(77);
#endif

That will cause the test to return a result of SKIP in the "make check" run.

iskunk avatar Nov 07 '19 21:11 iskunk

101 out of 130 seems GEFGW.

Thank you for the gplot patch. It doesn't save the tests, because we're writing to files named *.png, and gplot adds an appropriate extension, .pnm when outputting in pnm. But your fragment at least allows the failures to happen silently. Send me a list of tests that fail, and I'll add the fragment.

I was amused to see that the gplot pnm output has a lousy font for text, and is 200x bigger than the png output.

DanBloomberg avatar Nov 07 '19 22:11 DanBloomberg

Skipped the regression tests that use gplot (they all emit png). You should now be at 101 pass, 17 fail.

DanBloomberg avatar Nov 07 '19 23:11 DanBloomberg

Just realized that I can modify the gplot routines to return an image (png or pnm), simplifiying the procedure of generating it, and doing so without changing the gplotSimple1 (etc) interfaces.

Perhaps a weekend project.

DanBloomberg avatar Nov 07 '19 23:11 DanBloomberg

It's certainly getting better! Here is an initial set of tests that can be skipped:

Test name		Skip due to call to...
================	==========================
boxa3_reg		pixaGenerateFontFromString
boxa4_reg		pixaGenerateFontFromString
ccthin2_reg		l_bootnum_gen3
colorcontent_reg	pixaGenerateFontFromString
coloring_reg		pixaRead
colorize_reg		pixaGenerateFontFromString
enhance_reg		pixaGenerateFontFromString
genfonts_reg		pixaGenerateFontFromString
insert_reg		pixaWrite
mtiff_reg		(multi-page TIFF functionality)
numa2_reg		pixaGenerateFontFromString
ptra2_reg		pixaWrite
quadtree_reg		pixaGenerateFontFromString
rankhisto_reg		pixaGenerateFontFromString
shear2_reg		pixaRead
subpixel_reg		pixaRead
writetext_reg		pixaRead

Function			Dependency
==========================	==========
l_bootnum_gen3			zlib
pixaGenerateFontFromString	libtiff
pixaRead			libpng
pixaWrite			libpng

Separately: pngio_reg just needs the exit(77) bit.

This patch has a few further suggestions for the code:

  • gplot.c: Slight logic tweak so that output is still written to a .png file when a GPLOT_PNG -> GPLOT_PNM downgrade takes place. (This aside, having the gplot routines return an image object directly seems like it will be a better way to go. But for now, this should at least allow the gplot-using tests to be enabled.)
  • pdfio2.c: Implemented downgrade to L_FLATE_DECODE when zlib is available. This isn't a complete solution (there is no further fallback without zlib, not unless you implement an uncompressed PDF encoding), but zlib is usually available even when the other libs are not.

By the way, I ran into some trouble with the HAVE_LIB* conditionals in pdfio2.c when I noticed that the file was not #including config_auto.h. I would suggest adding that #include to every .c file that doesn't already have it. Even if a source file doesn't use HAVE_* et al., the config header could well define something like _XOPEN_SOURCE or function replacements that affect normal (non-preprocessor-oriented) code.

With the above change to pixGenerateCIData(), the pdfio1_reg test can be skipped when zlib is unavailable.

There are a few remaining tests needing attention, but let's get this low-hanging fruit out of the way first.

Re gplot font: I'm not conversant in Gnuplot, but you might find that tweaking the terminal type gives you better results. I assume getting the same pixels as png output should be possible.

iskunk avatar Nov 08 '19 17:11 iskunk

Coincindentally, another collaborator is independently fixing some build warnings, and he has been replacing #if defined(xxx) ==> #if XXX I believe you said that we should use #if defined(xxx) in general. Is that correct?

If we put #ifdef HAVE_CONFIG_H #include "config_auto.h" #endif /* HAVE_CONFIG_H */ in every source file, will that solve the problem and allow us to use #if defined(XXX) throughout?

DanBloomberg avatar Nov 08 '19 23:11 DanBloomberg

I'd rather not do your gplot change with output_actual.

Instead my plan is to do this: In gplot, we can generate the pix directly with a single call, without having to specify a rootname and then generating the file name to read it back. And this can be done changing only gplotSimpleXY1(), ...XY2() and ... XYN() and adding new functions that call them: gplotSimplePix1(), gplotSimplePix2(), gplotSimplePixN()

Once we have these functions, the use (e.g., in the regression tests) will simply be through regTestWritePixAndCheck() We will not need to know the names of the temporary files that gplot creates.

DanBloomberg avatar Nov 09 '19 00:11 DanBloomberg

Working late today...

For HAVE_* symbols defined by Autoconf, yes: #if defined(HAVE_FOO) (or #ifdef HAVE_FOO) should be used. Because HAVE_FOO will take one of the following forms in the config header:

#define HAVE_FOO 1

...or...

/* #undef HAVE_FOO */

Using #if HAVE_FOO in the latter case only works due to compiler leniency (it's a common flub), and with -Wundef will print a warning.

(I suspect your collaborator is favoring #if due to what they saw in environ.h, interpreting that as the convention used by the project. The problem, of course, is that environ.h itself needs to change to follow the Autoconf convention.)

Adding config_auto.h to every source file isn't about resolving the #if/#ifdef issue, but about making all the HAVE_* symbols available in the first place. The problem I was having with pdfio2.c was that all my added code, which is surrounded by #if defined(HAVE_LIBZ), was not getting compiled in even though HAVE_LIBZ was correctly #defined to 1 in config_auto.h. (If I had incorrectly done #if HAVE_LIBZ, the result would not have been any different.)

Re gplot changes: Hiding the temporary file(s) sounds great. It's a bit more work, but ultimately a better approach.

iskunk avatar Nov 09 '19 00:11 iskunk

Doesn't it make more sense to add those 3 lines to environ.h, which is included with allheaders.h in every source file?

DanBloomberg avatar Nov 09 '19 04:11 DanBloomberg

I am the collaborater. @iskunk is right that #if defined(HAVE_FOO) or #ifdef HAVE_FOO should be used with the Autoconf macros, but that requires modifications in src/environ.h.

It is important that the definitions from config_auto.h are included in each source file which uses HAVE_FOO. Adding the three lines to src/environ.h and removing them everywhere else looks like a reasonable solution for me.

stweil avatar Nov 09 '19 07:11 stweil