client_ruby icon indicating copy to clipboard operation
client_ruby copied to clipboard

`DirectFileStore` creates too many files for long running applications

Open maximkosov opened this issue 5 years ago • 7 comments

We use passenger to run our rails application. Worker processes recreated after every 100 requests. Recently we tried client_ruby with DirectFileStore. Each worker process has its own pid so depending on server load there will be tens of thousands of files in prometheus work dir after couple of hours/days when app wasn't restarted.

With 50k files in prometheus work dir /metrics route starts to be very slow with processing time about 10 seconds which could lead to prometheus scraper timeouts.

Is there any workaround for long running processes with DirectFileStore? One possible workaround I can think of is just restart app once in a few hours. Instead of restarting the whole application we can just wipe prometheus work directory once in a few hours, but this is looks a little bit hacky for me.

maximkosov avatar Jul 10 '19 08:07 maximkosov

This is related to issue #109, namely, how do you know whether a file is being written to by a process that's still alive. It's not exactly the same problem, since in this case, we can't get rid of files from old processes (or else counters would go down), but I have a feeling both problems will be solved in a related way.

dmagliola avatar Jul 10 '19 10:07 dmagliola

I would think an appropriate fix for this would be to have each process open a shared lock on the file it manages, for the duration of it needing the file.

Then, every time a file store tries creating a new file, you examine all those that exist in the directory that you can exclusively lock and compact them into a new file, while destroying all those that existed before.

Provided you think about how to do concurrency control, this would be a decent mechanism for solving it. But at this point I reckon you should probably be using an mmap system.

lawrencejones avatar Jul 10 '19 13:07 lawrencejones

Is there a locking system that will work across all OSes?

brian-brazil avatar Jul 10 '19 14:07 brian-brazil

Is there a locking system that will work across all OSes?

That would certainly be a consideration, and you'd probably end up with the compaction as an optional configuration on the file store. But honestly, I'd rather see a more performant mmap approach before doing this. Just wanted to sketch out a possible implementation.

lawrencejones avatar Jul 10 '19 14:07 lawrencejones

We haven't solved this, and we're using mmap over in Python. The reads were actually changed to not use mmap recently for performance.

brian-brazil avatar Jul 10 '19 14:07 brian-brazil

Another potential option would be to have one file per process, rather than one file per process/metric.

This doesn't fundamentally solve the issue, but it makes it orders of magnitude less problematic. The downside to this is that now each metric increment has essentially a mutex around it, so it is not the best performance in multi-threaded scenarios.

But if, like most Ruby apps, each process is running single-threaded, or if it's not incrementing counters that often, the performance penalty will probably be negligible. Each store save should be in the order of single-digit microseconds after all.

This would need a separate store (i'm not proposing modifying the existing file store to do this), but that's the point of having the swappable store backends, and it would be significantly easier to write than the compaction we're talking about here.

dmagliola avatar Jul 11 '19 06:07 dmagliola

I'm taking this out of the v0.10.0 milestone. I think we should come up with an answer to this problem, but I also think there's a lot of value in getting 0.10.0 out of alpha and into more people's hands in its current state.

I'm going to try and get through the documentation issues so we can get to a release.

After that, we can go one of a couple of ways:

  • If we find we need another run of breaking changes before 1.0, we can have an 0.11.0 release with alphas as needed.
  • If not, we'll basically promote what we've got here to 1.0.

Sinjo avatar Aug 19 '19 19:08 Sinjo