jdeb icon indicating copy to clipboard operation
jdeb copied to clipboard

Avoid excessive heap utilisation due to in memory creation of md5s

Open DanielThomas opened this issue 9 years ago • 5 comments

We noticed in a application with > 100K files that we ran into problems while generating the checksums. This writes to a file and streams from that file, to the output stream to avoid heap utilisation during that phase.

DanielThomas avatar Dec 04 '15 00:12 DanielThomas

Thanks for the contribution! I am a little puzzled though - why (even with 100k files) this was a problem. So I assume 100k files, times (random guess) 100 chars per line - that's 10.000.000 chars. That's probably around 20MB of RAM needed. Is that already what you meant by excessive? How much memory usage did you see? I am just wondering if this really was the problem.

tcurdt avatar Dec 04 '15 01:12 tcurdt

Going to set a breakpoint and catch the length, and get a heap dump and tell you exactly what the utilisation is. Certainly in the hundreds of megabytes, due to the length of the paths.

DanielThomas avatar Dec 04 '15 01:12 DanielThomas

Awesome - thanks!

Hundreds of megabytes? That sounds quite fishy. Actually - maybe you could print out the file size of the temp file? Or even better provide the file - be it obfuscate (e.g. with a simple tr) ?

tcurdt avatar Dec 04 '15 01:12 tcurdt

The final md5sums file is 33M. The StringBuilder will retain double that of course, thanks to Java's 2-byte representation of strings:

java.lang.StringBuilder [JNI Global, Stack Local ← checksums, md5s] 75497512

And two more copies again of the same bytes:

  • checksums.toString() @ ControlBuilder:147
  • pContent.getBytes("UTF-8") @ ControlBuilder:212

So a little over 220MB I guess. Background is here incidentally (we've got our fair share of heap issues in our Gradle plugin!):

https://github.com/nebula-plugins/gradle-ospackage-plugin/issues/142

DanielThomas avatar Dec 04 '15 01:12 DanielThomas

Thanks for digging into this. I guess the two copies are where the problems turns into excessive. I am wondering if we could dial back the crazy by getting rid of those copies. On the other hand in-memory will always hurt scalability.

I am not so eager to use temp files - but on the first look the PR looks reasonable. I need to poke around a bit more but I am inclined to accept it. Thanks for your work!

(I so need to get started on jdeb2)

tcurdt avatar Dec 04 '15 10:12 tcurdt