FBPINNs icon indicating copy to clipboard operation
FBPINNs copied to clipboard

Running Error

Open Gaurav11ME opened this issue 3 years ago • 3 comments

Hello,

I am getting an error while running the file paper_main_1D.py. I am using Spyder IDE on Anaconda.

"OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\gaura\Anaconda3\envs\torch\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies."

Gaurav11ME avatar Aug 19 '21 11:08 Gaurav11ME

This looks related to this pytorch issue: https://github.com/ultralytics/yolov3/issues/1643

A few thoughts come to mind:

  1. it could be a RAM issue, currently the script runs 23 parallel processes to train the runs defined in the script. You could try using DEVICES = ["cpu"]*4 to use less processes (and less RAM).
  2. it could be due to the multiprocessing pool of workers defined in shared_modules/multiprocess.py not "playing nicely" with windows (all of my tests were carried out using Linux / MacOS). It is worth testing without this class (i.e. training all of the runs in a large for loop on the main thread) to see if this is a problem

benmoseley avatar Aug 20 '21 15:08 benmoseley

Hello Ben,

I tried running it on Linux (Ubuntu) as well, on a computer with better RAM. Now that error is gone but the program is not exiting. It remains stuck after 23 runs. Below is the screenshot of the output. Error_Message

Gaurav11ME avatar Aug 24 '21 14:08 Gaurav11ME

That looks like normal behaviour, what should happen is that the script will also output a logging file per process in the current directory, named screenlog.main.[process id].log; if you look at these files you will see the training statistics output by each process as training progresses. I usually use the tailf linux command to monitor these files during training. Also you can use the top or htop linux commands to check your processes are indeed running, and if you are running on the GPU, using nvidia-smi or similar. The main program should stop once the training is complete across all the processes. Nb each training run is placed in a queue and the parallel processes concurrently process these until the queue is empty.

benmoseley avatar Aug 25 '21 10:08 benmoseley