pyorc icon indicating copy to clipboard operation
pyorc copied to clipboard

Memory leak in Writer?

Open JohnEmhoff opened this issue 4 years ago • 8 comments

Hello! Thanks for pyorc; using it has been a pleasure so far, with the exception that we seem to be running into memory issues. I think Writer is leaking memory? Our workload is roughly:

  • Open ~100 writers to different files
  • Iterate over our input rows (in the millions) and send each row to exactly one writer
  • Close all writers
  • Repeat

Memory usage will grow without bound between iterations. This, coupled with the fact that lowering the stripe size all the way down to 1M has no effect, makes me suspect a memory leak. Below is a script that will reproduce -- around iteration 10 it gets to 20G and then killed by the OOM killer on my machine. Let me know if there's anything I can do to help track it down!

https://gist.github.com/JohnEmhoff/274f6e05cba3f17a16683eb394bfe6b5

JohnEmhoff avatar Jan 26 '20 01:01 JohnEmhoff

I managed to trim down the script a good bit -- it turns out writing data is unnecessary, the leaks happen just creating writers:

https://gist.github.com/JohnEmhoff/55f562c2de701dfb426643a3e7751ef8

JohnEmhoff avatar Jan 27 '20 02:01 JohnEmhoff

Thank you for reporting it.

I think I successfully pinpointed the problem when Writer's constructor build an orc::Type from the TypeDescription.

I'm still looking for the concrete source of the leak.

noirello avatar Jan 28 '20 18:01 noirello

Thanks for looking into it. I think you're right -- I noticed that when my spec in the script above is just a column or two, it leaks much, much more slowly.

JohnEmhoff avatar Jan 29 '20 13:01 JohnEmhoff

After by passing the TypeDescription object still failed to run the iterations to the end. It seems like the orc:Writer object is somehow mishandled. Valgrind is not very helpful (although using it was never my strongest suit).

noirello avatar Feb 09 '20 18:02 noirello

I have the same problem. I tried to dig a bit and it seems the source of the leak is the creation of multiple ColumnWriter (of any type, string, float or int). The leak is proportional to the number of columns. Even more memory is leaked when ZLIB or ZSTD compression is enabled (currently enabled by default)

clynamen avatar Sep 03 '20 09:09 clynamen

Also I noticed the stripe size is not being honored. The stripe is not being flushed to disk and neither the memory freed (probably), but this part is being handled by the C++ library which make it harder to debug :(

carlosfvp avatar Jun 30 '21 04:06 carlosfvp

I found this recomendation. Using a method named writeIntermediateFooter will flush the content to the file and free some memory, but this only exist in the Java version of the OrcWriter 😥

https://www.mail-archive.com/[email protected]/msg00225.html

https://orc.apache.org/api/orc-core/org/apache/orc/impl/WriterImpl.html#writeIntermediateFooter--

carlosfvp avatar Jun 30 '21 21:06 carlosfvp

Fund a similar problem, can not flush content to file manually, and batch_size in Writer parameter seems invalid. Any solutions?

pokerc avatar Jul 27 '21 12:07 pokerc