kotlin-jupyter
kotlin-jupyter copied to clipboard
Heap space not released until OOM?
Hello,
I'm experiencing increasing memory consumption of a notebook until I end up with a "java.lang.OutOfMemoryError: Java heap space", althoug I'm just re-execution the only cell the notebook.
Here is my example using Kotlin Dataframe:
import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.io.readCSV
val data = DataFrame.readCSV("mydata.csv")
The .csv-file is about 260MB big, but I guess that shouldn't matter, just influence how long it takes to be out of memory.
As I re-execute this cell again and again I can see the memory consumption of the kernel's Java process increase until it hits the memory limit I configured and I end up with the OutOfOfMemoryError. While the kernel keeps the content of data for subsequent cells, I would expect it to release any old "version" of that variable when I re-execute the cell. Hence, the memory consumption should not increase with every execution.
Seems like a bug to me. Or am I doing it wrong? Or just misunderstanding how it's supposed to work?
Hello,
Thank you for providing a clear description. It was easy to understand and reproduce.
In short, if you only re-execute a cell (without a kernel restart), it doesn’t release any old "version" of that variable (and doesn’t release memory allocated to the previous ones).
More specifically: we store instances for all executed snippets and may reference these values even if they are hidden with the variables defined later. That's why we have hard references to all objects that were defined as variables in all the snippets.
And as a result, repeatedly executing this cell will lead to an OutOfMemoryError over time.
Also we already have a bit similar task (for improvement UX in similar cases) in a Kotlin Notebook project in our YouTrack https://youtrack.jetbrains.com/issue/KTNB-553/Show-notification-with-action-in-the-case-of-java.lang.OutOfMemoryError-Java-heap-space (and you can vote for it)
Did I answer your question? Has it become clearer now?
Thank you for reprocessing and the clarification, at least I know, that I was not doing it wrong. Unfortunately, restarting the kernel is not a good option for me, because this way I'd loose data loaded in earlier cells and would have to execute the entire notebook from the start.
And I still see no reason, why the kernel should keep the data from previous executions, because it would not be accessible anymore once a variable is overridden. Are there plans to change this?
.... because it would not be accessible anymore once a variable is overridden.
it's not like that.
Even in this case you have access to the previous values of a variable
Even in the same cell (after re-executing):
It means basically this:
... we store instances for all executed snippets and may reference these values even if they are hidden with the variables defined later.
And about:
Are there plans to change this?
We don't have such plans, at least at the current moment, but in general this question is quite complicated. For example, imagine that you have code like this:
val x = 3 // first cell
---
fun f() = x * 3 // second cell
---
val x = "str" // third cell
---
f() // What should this cell produce now?
In the current architecture it's clear as long as we store all values. If we override values instead of hiding them, it's not so clear.
But let me ask a question: why do you need to re-execute a cell with a dataset again and again? Could you please describe your use case here?
Ah, thanks. I didn't know that I can access these varaibles via this@....
I must admit, that don't know, what f() would or should do in your last example. Maybe this should not even compile to prevent me from writing code I don't understand. ;-) Well, seriously, I would have expected all symbols to exist once per kernel and the last cell to write it "wins". But I don't have much experience with jupyter, so if that's the way it is supposed to work, I happily learn something new.
Re-executing cells usually happens while I write my notebook, i.e. I change the code and try it until it works. And when I run into the OOM-Error, I have to restart the kernel and execute the notebook from the beginning. Together with kotlin-dataframe's high memory consumption (https://github.com/Kotlin/dataframe/issues/141), this happens frequently.
....Well, seriously, I would have expected all symbols to exist once per kernel and the last cell to write it "wins"....
in general it sounds nice, but this question is discussable.
...I happily learn something new.
and thank you for your questions! It let us to get new point of view and very useful for future improvements.
Re-executing cells usually happens while I write my notebook, i.e. I change the code and try it until it works....
This part is quite interesting. As a partially workaround you can try to read file in a separate cell and rest part of code in other cells.
val data = DataFrame.readCSV("mydata.csv")
It should help in case if you don't often change the original dataset (file) But, it wouldn't help if you need to update your original file.
Another option: if you use IntelliJ IDEA (Ultimate Edition) and plugin Kotlin Notebook you can set heap size manually via settings and it also should help a bit (if you have enough memory)
Thanks for the tips. I already increased the memory limit when iI first encountered the OOM-Error. And splitting the code into smaller cells would help, too. But all this only postpones the problem. With a growing notebook I guess I would always hit the memory limit sooner or later. So some mechanism to release memory should be implemented. If access to values from former executions of cells is a valuable feature (actually, I don't feel the need for that), a way to throw away data explicitly would be helpful, so users can actively decide that they don't need that data anymore.
thank you for your ideas! I agree that may be we can somehow improve UX/UI here.
I filed an issue in our tracker (Kotlin Notebook) and you can vote for it if you wish.
But let me ask a question: why do you need to re-execute a cell with a dataset again and again? Could you please describe your use case here?
Hi I found this thread when I also ran into memory issues. Christopher covered most of the issue but just wanted to emphasize that this scenario (re-executing the same cell) is extremely common. What makes notebooks so great for development (data analytics but also a lot of general software dev) is the ability to break the code up into chunks and keep modifying one small chunk over and over until it produces the desired result. Or in the case of data analysis, a lot of times it is to change a parameter and see how the results change. With something like Kotlin DataFrame (this is also true of Python's pandas), the capabilities are very powerful but it can often be very tricky to get the syntax exactly right on the first, second try (group by, pivot, aggregate, much more). Notebooks also allow you to branch off - ok, these first 5 cells work fine, but cell 6 and 7 have two different approaches to the next step, you may re-run #5 many times to "reset" the variable to an earlier state (i.e. coming from cell #4).
Basically the notebook is most useful in the earliest stage of development specifically because of the ability to continually iterate on individual code blocks, as opposed to having to continually start a script over from scratch. Most code will eventually "graduate" from notebook to a more formal script or program with a traditional top-to-bottom execution (or function call, etc)
I just wanted to offer a little more detail in answering that specific question you had and explain this very common scenario, emphasizing that it's very important for any notebook kernel to anticipate. Thanks :)