pyorc
pyorc copied to clipboard
Memory leak in Writer?
Hello! Thanks for pyorc; using it has been a pleasure so far, with the exception that we seem to be running into memory issues. I think Writer
is leaking memory? Our workload is roughly:
- Open ~100 writers to different files
- Iterate over our input rows (in the millions) and send each row to exactly one writer
- Close all writers
- Repeat
Memory usage will grow without bound between iterations. This, coupled with the fact that lowering the stripe size all the way down to 1M has no effect, makes me suspect a memory leak. Below is a script that will reproduce -- around iteration 10 it gets to 20G and then killed by the OOM killer on my machine. Let me know if there's anything I can do to help track it down!
https://gist.github.com/JohnEmhoff/274f6e05cba3f17a16683eb394bfe6b5
I managed to trim down the script a good bit -- it turns out writing data is unnecessary, the leaks happen just creating writers:
https://gist.github.com/JohnEmhoff/55f562c2de701dfb426643a3e7751ef8
Thank you for reporting it.
I think I successfully pinpointed the problem when Writer's constructor build an orc::Type
from the TypeDescription
.
I'm still looking for the concrete source of the leak.
Thanks for looking into it. I think you're right -- I noticed that when my spec in the script above is just a column or two, it leaks much, much more slowly.
After by passing the TypeDescription
object still failed to run the iterations to the end. It seems like the orc:Writer
object is somehow mishandled. Valgrind is not very helpful (although using it was never my strongest suit).
I have the same problem. I tried to dig a bit and it seems the source of the leak is the creation of multiple ColumnWriter (of any type, string, float or int). The leak is proportional to the number of columns. Even more memory is leaked when ZLIB or ZSTD compression is enabled (currently enabled by default)
Also I noticed the stripe size is not being honored. The stripe is not being flushed to disk and neither the memory freed (probably), but this part is being handled by the C++ library which make it harder to debug :(
I found this recomendation. Using a method named writeIntermediateFooter will flush the content to the file and free some memory, but this only exist in the Java version of the OrcWriter 😥
https://www.mail-archive.com/[email protected]/msg00225.html
https://orc.apache.org/api/orc-core/org/apache/orc/impl/WriterImpl.html#writeIntermediateFooter--
Fund a similar problem, can not flush content to file manually, and batch_size in Writer parameter seems invalid. Any solutions?