sharp icon indicating copy to clipboard operation
sharp copied to clipboard

sRGB.icc: Leaked file descriptor?

Open papandreou opened this issue 7 years ago • 24 comments

I'm debugging a minor leak problem with a sharp-based service I'm running. From /proc/$PID/fd I can tell that over time, it seems to acquire more and more open file descriptors pointing at node_modules/sharp/icc/sRGB.icc. The highest number I've seen so far is 109.

Is there a chance that the fd will be leaked in some error case?

I'm running sharp 0.16.0 for now, but an upgrade to 0.17.0 is on its way.

papandreou avatar Jan 09 '17 09:01 papandreou

Hello, my best guess would be that you're seeing the effect of libvips caching icc_transform operations. The profile file handle is closed via lcms2's cmsCloseProfile only when the relevant libvips' operation is "disposed" of.

lovell avatar Jan 09 '17 10:01 lovell

That sounds likely. Can you think of a way I can test your theory?

I'm running with sharp's cache disabled via require('sharp').cache(false). I assume this particular part of libvips' operation cache isn't governed by that?

papandreou avatar Jan 09 '17 10:01 papandreou

If libvips' cache is disabled then these operations should already be disposed of and the file handles closed. I wonder if there's an error path where they don't get cleared up?

lovell avatar Jan 09 '17 18:01 lovell

I was worried this might be a libvips problem. I made a tiny test script:

#!/bin/sh

profile=/usr/share/color/argyll/ref/sRGB.icm
vipsthumbnail *.jpg -e $profile --vips-cache-max=0 --vips-leak &
pid=$!

while ls -l /proc/$pid/fd ; do sleep 0.1; echo; done

With git master (after I fixed an oops in the handling of the --vips-cache-max flag), the output is pleasantly boring. It never goes over 5 fds, even with many broken test files.

total 0
lr-x------ 1 john john 64 Jan 10 14:16 0 -> /dev/null
l-wx------ 1 john john 64 Jan 10 14:16 1 -> /home/john/pics/x
lrwx------ 1 john john 64 Jan 10 14:16 2 -> /dev/pts/4
l-wx------ 1 john john 64 Jan 10 14:16 3 -> /home/john/pics/tn_wac_nearside.jpg
lr-x------ 1 john john 64 Jan 10 14:16 4 -> /home/john/pics/wac_nearside.jpg

I don't know if that helps at all :(

jcupitt avatar Jan 10 '17 14:01 jcupitt

I added some debugging to libvips to track the opening/closing of profiles then ran the sharp test suite (includes CMYK to sRGB conversion, missing/broken embedded profiles etc.) and found that every cmsOpenProfileFrom* had a matching cmsCloseProfile.

@papandreou Is this using sharp with the pre-compiled binaries or a globally-installed libvips? If the latter, which version of libvips and lcms2 are in use?

lovell avatar Jan 12 '17 23:01 lovell

Thanks for looking into this, both! I'm happy that you took the time to rule out the most obvious suspects. I'll try to narrow it down further, there's plenty of chance that I'm mistreating the software, or neglecting to clean up correctly eg. when aborting an operation (I use the "readable stream" API).

@lovell I'm using the pre-compiled binaries. I've now confirmed that it happens with both sharp 0.16.0 and sharp 0.17.0.

The problem is reproducible by just having that service running in production for a while, so as a first thing I think I'll try hooking up require('sharp').counters() to my metrics system to see if sharp believes it's in the middle of processing a lot of images that the application doesn't know about.

papandreou avatar Jan 13 '17 08:01 papandreou

The process and queue counters come down to zero every time there's a bit of inactivity -- so it seems like the app doesn't leave any operations hanging.

I'll try to see if I can find some specific images that trigger the leak when processed. I just noticed that /proc/$PID/fd includes a creation time, so I might be able to cross-reference with the logs and get lucky :)

papandreou avatar Jan 14 '17 23:01 papandreou

I found all the images that were processed by a specific worker during the minute where an sRGB.icc fd was leaked, and processed them over and over again at various concurrencies on my dev setup. Unfortunately, I could not reproduce the leak. Back to the drawing table.

papandreou avatar Jan 16 '17 09:01 papandreou

I've added fd monitoring to a (Linux kernel v4.4.x) production system but have yet to find any evidence.

lovell avatar Jan 22 '17 09:01 lovell

Thanks for trying. If you share the fd monitoring code, I can add it as well?

papandreou avatar Jan 22 '17 21:01 papandreou

sudo lsof -p `pgrep node` 2>/dev/null | grep icc | wc -l

lovell avatar Jan 23 '17 08:01 lovell

I've been monitoring millions of images processed each day over the last two months but have seen no evidence of this problem.

@papandreou Any updates from your checks?

lovell avatar Mar 23 '17 00:03 lovell

Unfortunately no actual progress. I'm still seeing about 50 of those leaked file handles per day in the entire installation of the image processing service, which is serving about 3 million processed images per day. About 95% of those are piped right through or can be served from cache without utilizing sharp. In other words the leak happens approximately 1 out of 3000 times when sharp is involved.

The images that are being processed are uploaded by users, and I haven't been able to map out if there's any pattern wrt. the types/characteristics of the images that cause the problem.

It could be that the code that uses sharp is not getting things cleaned up correctly when the incoming request is aborted, or the image turns out to be corrupt, or exceeds the enforced pixel limits.

The sharp cache is disabled and I've already confirmed that the values in require('sharp').counters() come down to zero consistently. Does that theoretically capture all the cases where sharp would hold on to resources?

papandreou avatar Mar 23 '17 09:03 papandreou

There have been a number of relatively minor leak-related patches to lcms2 since its v2.8 release - see https://github.com/mm2/Little-CMS/commits/master

For the sharp v0.18 release, which will bundle libvips v8.5.x, I'm going to include all the latest lcms2 patches if v2.9 has not yet been released by that time (the bundled libtiff does this too).

lovell avatar Mar 24 '17 13:03 lovell

Thanks for the follow up. Looking forward to that release :)

I've extracted a sample of all images that were processed by the service in a span of 24 hours, and I'll run ab -c 20 -n 200 for each of those over the weekend, noting the number of open sRGB.icc file descriptors after each run. Hopefully that'll narrow down the list of suspects.

papandreou avatar Mar 24 '17 16:03 papandreou

My machine has been running for 4 days now with the production workload. The first time with ab -c 20 -n 200 for each image, and the second time with many curl commands, where about half of them would be killed before completing (to see if it's due to improper cleanup). "Unfortunately" the leak didn't even happen once.

Is there even a remote a chance that this could somehow be down to the fact that I run Ubuntu 16.04 on my machine, and the production system runs Debian Wheezy?

papandreou avatar Mar 28 '17 06:03 papandreou

I notice lcms2 does not use the e "close-on-exec" flag to fopen ICC profiles - see https://github.com/mm2/Little-CMS/blob/master/src/cmsio0.c#L388

This means a new process spawned by an existing process could inherit open file descriptors from lcms2. If those spawned processes terminate abnormally, it's usually up to the OS to close dangling file descriptors, which might explain a difference in behaviour between Debian 7 and Ubuntu 16.04.

lovell avatar Mar 28 '17 16:03 lovell

@lovell That looks like a very good catch. I'll rerun my experiments on Wheezy and see if I can reproduce it.

Would it require the sharp workers to crash, or does lcms2 itself spawn processes? If so, is there any way I can detect that happening in my production system?

papandreou avatar Mar 30 '17 08:03 papandreou

If your node process uses child_process, either directly or via a dependency e.g. gm, those child processes can inherit a lcms2-opened file descriptor from an in-flight sharp/libvips request in the parent process.

I suspect there'll be a PR heading in lcms2's direction soon :)

lovell avatar Mar 30 '17 10:03 lovell

Okay, thanks for following up from that angle. Crossing my fingers that it'll help.

I've tried for a week to reproduce it on a Wheezy box without any luck, based on replaying that same 24h production workload with and without aborted requests. It seems like I don't really have a way to reproduce it except in the production env :(

papandreou avatar Apr 06 '17 08:04 papandreou

I just went back and checked how the service is running, and it seems like much fewer sRGB.icc fds are open now, and I basically have no idea why. It's down to 2-4 per worker process even after they've been running for a week, not enough to raise any red flags.

I'll close this issue for now, but will report back if it arises again.

Again, thanks for the help, everyone!

papandreou avatar May 03 '17 13:05 papandreou

Tak for the update Andreas. It's probably still worth adding the "close-on-exec" flag to lcms2 as that's still the best guess as to what was happening here.

lovell avatar May 03 '17 19:05 lovell

PR for lcms - https://github.com/mm2/Little-CMS/pull/121

lovell avatar May 03 '17 21:05 lovell

See https://github.com/libvips/libvips/pull/2816

lovell avatar May 21 '22 14:05 lovell

v0.31.0 now available with prebuilt binaries containing updated versions of lcms and libvips that contain the fix.

lovell avatar Sep 05 '22 09:09 lovell