FBPINNs
FBPINNs copied to clipboard
Running Error
Hello,
I am getting an error while running the file paper_main_1D.py. I am using Spyder IDE on Anaconda.
"OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\gaura\Anaconda3\envs\torch\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies."
This looks related to this pytorch issue: https://github.com/ultralytics/yolov3/issues/1643
A few thoughts come to mind:
- it could be a RAM issue, currently the script runs 23 parallel processes to train the runs defined in the script. You could try using
DEVICES = ["cpu"]*4
to use less processes (and less RAM). - it could be due to the multiprocessing pool of workers defined in
shared_modules/multiprocess.py
not "playing nicely" with windows (all of my tests were carried out using Linux / MacOS). It is worth testing without this class (i.e. training all of the runs in a large for loop on the main thread) to see if this is a problem
Hello Ben,
I tried running it on Linux (Ubuntu) as well, on a computer with better RAM. Now that error is gone but the program is not exiting. It remains stuck after 23 runs. Below is the screenshot of the output.
That looks like normal behaviour, what should happen is that the script will also output a logging file per process in the current directory, named screenlog.main.[process id].log
; if you look at these files you will see the training statistics output by each process as training progresses. I usually use the tailf
linux command to monitor these files during training. Also you can use the top
or htop
linux commands to check your processes are indeed running, and if you are running on the GPU, using nvidia-smi
or similar. The main program should stop once the training is complete across all the processes. Nb each training run is placed in a queue and the parallel processes concurrently process these until the queue is empty.