csharp-notebooks icon indicating copy to clipboard operation
csharp-notebooks copied to clipboard

ML Notebook consumes all the available memory, forcing Windows to close processes

Open andrasfuchs opened this issue 2 years ago • 9 comments

The Training and AutoML notebook is able to consume a lot of memory, causing to hang or crash other processes.

Strangely enough, it usually works fine if you run the notebook only once. So to reproduce the problem, you should:

  1. Open Windows Task Manager, and check your memory usage
  2. Open Training and AutoML notebook image
  3. Run it's snippets one by one, but stop at "Use AutoML to simplify trainer selection and hyper-parameter optimization." image
  4. Run the "Use AutoML to simplify trainer selection and hyper-parameter optimization" code. image
  5. Sometimes it works fine, but last time at this point my system hang and terminated some VS processes and closed my browser unexpectedly. Memory consumption dropped back to ~950 MBs, and the notebook got into a seemingly endless loop of "Starting Kernel". image
  6. When I tried to re-run the "Use AutoML to simplify trainer selection and hyper-parameter optimization" code snippet again, I got the following exception, repeating over and over: image
error: The JSON-RPC connection with the remote party was lost before the request could complete. 
    at StreamJsonRpc.JsonRpc.<InvokeCoreAsync>d__154.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at StreamJsonRpc.JsonRpc.<InvokeCoreAsync>d__143`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.VisualStudio.Notebook.Utils.DetectKernelStatusService.<ExecuteTaskAsync>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.VisualStudio.Notebook.Utils.RepeatedTimeTaskService.<>c__DisplayClass7_0.<<ExecuteAsync>b__1>d.MoveNext()
  1. If you could run the notebook without issues, try to re-run the "Use AutoML to simplify trainer selection and hyper-parameter optimization" code many times, it is inconsistent on my machine as well.

andrasfuchs avatar Jul 04 '22 12:07 andrasfuchs

I suspect it's because the trial is still running even after that automl cell finished. Somehow AutoMLExperiment doesn't always succeed in cancelling the last running trial..

LittleLittleCloud avatar Jul 06 '22 20:07 LittleLittleCloud

We probably also need to clean up some things in our NotebookMonitor -

https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.AutoML.Interactive/NotebookMonitor.cs

It could be holding references to a lot of things.

@andrasfuchs if you "restart kernel" does it free up the memory for you?

I'll dig more to see if I can find anything.

JakeRadMSFT avatar Jul 07 '22 17:07 JakeRadMSFT

@JakeRadMSFT How can I restart the kernel?

andrasfuchs avatar Jul 11 '22 07:07 andrasfuchs

@andrasfuchs if you’re using latest notebook editor extension there is a restart button in notebook toolbar.

image

JakeRadMSFT avatar Jul 12 '22 17:07 JakeRadMSFT

I tried it again today, but after a "Run All", it got crazy again, eating up the memory and closing other running processes.

image

The critical part got terminated with an exception.

image

The memory was not freed up after the exception, I had to close the Visual Studio process manually. I had no chance to test the kernel restart.

andrasfuchs avatar Jul 13 '22 13:07 andrasfuchs

@LittleLittleCloud thoughts here?

JakeRadMSFT avatar Jul 13 '22 18:07 JakeRadMSFT

I was thinking there's some places we forget to clear trial result and release memory (like hold all models in memory) but I didn't see the memory goes up while training. So now I suspect the crazy memory usage is caused by LightGbm trainer, which is possible to have bad-memory allocation especially when the search space goes big

@andrasfuchs Can you try disable lgbm trainer by setting useLgbm: false next to useSdca:false

in the following code snippet image

and try the notebook again

LittleLittleCloud avatar Jul 13 '22 19:07 LittleLittleCloud

And @JakeRadMSFT , maybe it would be helpful to add a system monitor section together with trial Monitor?

LittleLittleCloud avatar Jul 13 '22 19:07 LittleLittleCloud

I got the gray rectangles instead of the results, but the memory problem seems to be better if I use useLgbm: false. image

10+ GBs of RAM usage is still a lot, I think... image

...and this memory is not freed up after the notebook run was completed.

andrasfuchs avatar Jul 17 '22 22:07 andrasfuchs