iipsrv icon indicating copy to clipboard operation
iipsrv copied to clipboard

Write error does not raise an HTTP error

Open scossu opened this issue 3 years ago • 10 comments

While performing a load test on an IIPImage server, I occasionally get error messages such as Error writing strip: <n> and Error writing output. I don't see a corresponding HTTP code that indicates a failure.

I am not able to inspect the output of these requests since they are produced occasionally under heavy traffic situations (using a traffic simulator that discards the responses), but judging from the error messages I suppose that the content of the images may be corrupted. The root cause of this behavior is still being investigated but it's not the main point of this ticket.

Tracking down the messages I came across the logic used to produce these: https://github.com/ruven/iipsrv/blob/master/src/CVT.cc#L609-L614 and https://github.com/ruven/iipsrv/blob/master/src/CVT.cc#L580-L584 It seems that on a write error, the CVT application only logs the error and continues on.

Should the application rather terminate the request by sending a 5xx error, so that these issues can be more easily identified?

scossu avatar May 14 '21 20:05 scossu

Maybe related, but on another error, I sometimes get TPTImage :: TiffSetDirectory() failed messages as the body of the image, which is sent as image/jpeg content with a 200 status:

content-type: image/jpeg
content-length: 203
date: Wed, 08 Sep 2021 17:18:11 GMT
x-powered-by: IIPImage
cache-control: max-age:3600
last-modified: Wed, 08 Sep 2021 16:42:23 GMT
content-disposition: inline;filename="3a84b1a0-f395-4642-bdaf-870a5652fc03.jpg"
access-control-allow-origin: *
x-varnish: 39714904 39223465
via: 1.1 varnish (Varnish/6.5), 1.1 8c80b6c82514458b3d30fbde4b4a2dd5.cloudfront.net (CloudFront)
accept-ranges: bytes
strict-transport-security: max-age=15724800; includeSubDomains
x-cache: Miss from cloudfront
x-amz-cf-pop: LAX3-C2
x-amz-cf-id: EXoW-aZlNYxIWdIb5qJx49rmwPJHRAE9XHMct3_xi8YiX8KQ00nVjQ==
age: 2146

Status: 404 Not Found
Server: iipsrv/1.2
Content-Type: text/plain; charset=utf-8
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: X-Requested-With

TPTImage :: TIFFSetDirectory() failed

(this is going through Cloudfront and Varnish; the lines after age: 2146 are the actual response body)

This is also happening randomly, so it's really hard to catch as it happens.

scossu avatar Sep 08 '21 17:09 scossu

For errors such as the strip writing errors in CVT.cc, I've leaned throughout the code towards allowing the server to continue working and providing some sort of output even if an error occurs. I guess there's a balance between allowing imperfect output to be sent and strict error checking.

For your TIFFSetDirectory() errors, I'm not sure what could be causing this. Are your images hosted on a remote drive via NFS or similar? What is in the iipsrv log file when this error occurs? In any case, I've just uploaded a commit which adds extra error checking for TIFFSetDirectory failures, so it should be easier to track these.

ruven avatar Sep 13 '21 20:09 ruven

Ruven, thanks for adding the extra checks. I will pull in your code and test. Probably this won't change what I was seeing, since that error message is exactly what I got as an image (so it happened probably somewhere you have that check already).

I'm having a hard time finding the logs at the exact time the TIFFSetDirectory error occurs, since this is happening in a production environment that is continually under load. My source TIFFs are in a NFS (Amazon EFS) volume, so it may well be that a network glitch is the cause. That would be normally tolerable, except that the file_error resolves into a broken image (HTTP 200) with the error message as its content. Thus, the broken image gets cached downstream and persists for a longer time than it should. Could that be a HTTP 50x instead, so as to minimize the effect of a possible transitional glitch?

scossu avatar Sep 13 '21 21:09 scossu

I realize that the strip writing error may be different and I apologize if I bundled the two issues into one. What is the consequence of that error? A corrupt image or tile? I am unable to see the results directly since that occurs among thousands of image requests.

scossu avatar Sep 13 '21 21:09 scossu

Thus, the broken image gets cached downstream and persists for a longer time than it should.

The new file_error checks I added should result in a HTTP 404 error, which should be easier to track down. I'll try to simulate an unstable NFS mount to see how iipsrv reacts. iipsrv should really handle this kind of thing better.

What is the consequence of that error?

The strip wring error should just result in a corrupt image or tile.

ruven avatar Sep 13 '21 22:09 ruven

The new file_error checks I added should result in a HTTP 404 error

I think a 500 would be more fitting, since it's a transient server error (normally 404s should not be retried).

I'll try to simulate an unstable NFS mount to see how iipsrv reacts. iipsrv should really handle this kind of thing better.

Thanks. I'll try to isolate the issue as well.

What is the consequence of that error?

The strip wring error should just result in a corrupt image or tile.

Would it be possible to offer a config option to either report and continue, or throw an HTTP error on this class of exception? Some implementers (I for one) would prefer the latter, since passing a corrupt image for good might result in several issues.

scossu avatar Sep 13 '21 22:09 scossu

I have encountered the same strip and output errors. I have done some testing using curl and Apache 2.4 with current IIPImage Windows 10 build - Memcached not used.

My observations so far:

These errors happen only if caching is enabled (ie. request contains the If-Modified-Since header).

They happen while using both IIIF and IIP protocols.

They happen when bigger image size is requested (according to my testing HEI >= 110) ie.:

  • source JPG2000 image dimensions: 3792x3121
  • FIF=/a.jp2&HEI=110&CVT=jpeg -> scaled region size: 134x110

These errors seem to have no impact on the output image (tested in a browser).

The test batch contained following image heights: 10,20,30,40,50,60,70,80,90,99,100,105,110,120,150,200

curl --silent "http://iip.test/fcgi-bin/iipsrv.fcgi?FIF=/a.jp2&HEI=10&CVT=jpeg"
curl --silent "http://iip.test/fcgi-bin/iipsrv.fcgi?FIF=/a.jp2&HEI=10&CVT=jpeg" --header "If-Modified-Since: Fri, 21 Feb 2022 14:00:00 GMT"
curl --silent "http://iip.test/fcgi-bin/iipsrv.fcgi?FIF=/a.jp2&HEI=10&CVT=jpeg" --header "If-Modified-Since: Fri, 21 Feb 2022 14:00:00 GMT"
curl --silent "http://iip.test/fcgi-bin/iipsrv.fcgi?FIF=/a.jp2&HEI=20&CVT=jpeg"
curl --silent "http://iip.test/fcgi-bin/iipsrv.fcgi?FIF=/a.jp2&HEI=20&CVT=jpeg" --header "If-Modified-Since: Fri, 21 Feb 2022 14:00:00 GMT"
curl --silent "http://iip.test/fcgi-bin/iipsrv.fcgi?FIF=/a.jp2&HEI=20&CVT=jpeg" --header "If-Modified-Since: Fri, 21 Feb 2022 14:00:00 GMT"
...

IIP config used:

FcgidInitialEnv MAX_IMAGE_CACHE_SIZE "50"
FcgidInitialEnv JPEG_QUALITY "90"
FcgidInitialEnv MAX_CVT "3500"
FcgidInitialEnv MAX_LAYERS "-1"
FcgidInitialEnv ALLOW_UPSCALING "1"
FcgidInitialEnv EMBED_ICC "0"

FcgidIdleTimeout 0
FcgidMaxProcessesPerClass 10

Might be related:

Errors flushing data or output occur more likely if the server is under stress - ie. no waiting between requests.

filak avatar Feb 25 '22 16:02 filak

@ruven Might that be just a false error(s) related to cached images size checking ?

filak avatar Mar 01 '22 13:03 filak

I have been playing with setting different (higher) FcgidOutputBufferSize as suggested here https://github.com/ruven/iipsrv/issues/65 but this does not seem to have much effect.

filak avatar Mar 03 '22 13:03 filak

I see no FIF errors using the recent IIP version (https://github.com/ruven/iipsrv/commit/93241f8db841e4bd42ae53f94f45af7552086c72) - either with/without Memcached support.

filak avatar Apr 07 '22 13:04 filak