nni icon indicating copy to clipboard operation
nni copied to clipboard

Node server crash

Open XiaoXiao-Woo opened this issue 1 year ago • 6 comments

image Besides, when I use nni to connect another machine (it can connect itself with "remote" platform), the same problem occurs: "FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory"

Environment:

  • NNI version: master
  • Training service (local|remote|pai|aml|etc): local, and reusemode=False
  • Client OS: windows 10.0.19042.1466
  • Server OS (for remote mode only): windows 10.0.19042.1466
  • Python version: 3.8
  • PyTorch/TensorFlow version:
  • Is conda/virtualenv/venv used?:
  • Is running in Docker?:

Configuration:

  • Experiment config (remember to remove secrets!):
  • Search space:

Log message:

  • nnimanager.log:
  • dispatcher.log:
  • nnictl stdout and stderr:

How to reproduce it?:

XiaoXiao-Woo avatar Jun 29 '23 07:06 XiaoXiao-Woo

I'm also getting this error consistently when the number of trials in an experiment gets about about 40,000. Has happened on > 5 different experiments.

NNI version: master
Training service (local|remote|pai|aml|etc): local, and reusemode=False
Client OS: ubuntu 22.04.2 LTS
Server OS (for remote mode only): n/a
Python version: 3.7.4
PyTorch/TensorFlow version: n/a
Is conda/virtualenv/venv used?: conda
Is running in Docker?: no

studywolf avatar Aug 06 '23 23:08 studywolf

You can set export NODE_OPTIONS="--max_old_space_size=8192" for a quick fix.

liuzhe-lz avatar Aug 08 '23 07:08 liuzhe-lz

ah, thanks for the suggestion! It spools up the run when I resume but immediately fails on me silently

studywolf avatar Aug 11 '23 00:08 studywolf

any progress on this?

studywolf avatar Sep 06 '23 00:09 studywolf

checking in again

studywolf avatar Oct 06 '23 17:10 studywolf

You can set export NODE_OPTIONS="--max_old_space_size=8192" for a quick fix.

unfortunately, still getting the heap out of memory error after 29k trials. this is in nni 3.0

studywolf avatar Oct 07 '23 17:10 studywolf