Pytorch_fine_tuning_Tutorial icon indicating copy to clipboard operation
Pytorch_fine_tuning_Tutorial copied to clipboard

ImportError: DLL load failed: The paging file is too small for this operation to complete.

Open toiyeumayhoc opened this issue 6 years ago • 12 comments

after run the main_fine_tuning.py file, i got this trace back:

Epoch 0/99
LR is set to 0.001
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "main_fine_tuning.py", line 265, in <module>
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    num_epochs=100)
  File "main_fine_tuning.py", line 162, in train_model
    for data in dset_loaders[phase]:
  File "C:\Users\dk12a7\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 501, in __iter__
    return _DataLoaderIter(self)
  File "C:\Users\dk12a7\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 289, in __init__
    w.start()
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
    exitcode = _main(fd)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\dk12a7\Anaconda3\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\dk12a7\Anaconda3\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\dk12a7\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\dk12a7\Desktop\code classification\Pytorch_fine_tuning_Tutorial\main_fine_tuning.py", line 4, in <module>
    import torch
  File "C:\Users\dk12a7\Anaconda3\lib\site-packages\torch\__init__.py", line 80, in <module>
    from torch._C import *
ImportError: DLL load failed: The paging file is too small for this operation to complete.
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe

i tried to set the BATCH_SIZE =1 , but this problem still occur. Do you have any solution for this one?

toiyeumayhoc avatar Dec 04 '18 13:12 toiyeumayhoc

I ran into the same problem, have you found a solution?

brianFruit avatar Jan 21 '19 23:01 brianFruit

@brianFruit still stuck in this one.

toiyeumayhoc avatar Jan 28 '19 15:01 toiyeumayhoc

I've also encountered that problem and it seems that this is a multiprocessing problem. What worked for me was reducing the number of workers in DataLoader (line 108 in your code). Your number is quite high - 25. Workers are subprocesses that load the data, so if you have 25 of them your cpu can rebell :) Try reducing it to 1, and if that works you can try to increase it. If I'm resoning correctly it shouldn't exceed number of your logical processors in CPU (but if you are comupting something parallely, like me rigth now, with another dataloader, you should decrease it even more).

Hope that help future generations

MarcinMisiurewicz avatar Jun 06 '19 08:06 MarcinMisiurewicz

Hi there, I find the same problem with my setups (both in Windows). Originally had a X99 with a 8 core CPU with 64GB of RAM and 2x RTX2080ti and was able to run up to 6x pytorch RL algorithms with up to 10 multiprocessing workers each (total 60 workers running in parallel - obviously they were taking turns). If I pushed passed those numbers, I would get those errors as described above. Now, I changed my setup to be a 3970X with 32 cores 64GB Ram and the same 2x GPUs. I can barely run 3x of the same algos with up to 8 workers each. Any loading more than that generates the same error. When running them the RAM used never more than 40-50%. Any pointing in the right direction will be highly appreciated. Thanks!

Javierete avatar Jan 27 '21 07:01 Javierete

I think I managed to solve it (so far). Steps were: 1)- Windows + pause key 2)- Advanced system settings 3)- Advanced tab 4)- Performance - Settings button 5)- Advanced tab - Change button 6)- Uncheck the "Automatically... BLA BLA" checkbox 7)- Select the System managed size option box. 8)- OK, OK, OK..... Restart PC. BOOM

Not sure if it's the best way to solve the problem but it worked so far (fingers crossed)

Javierete avatar Jan 27 '21 08:01 Javierete

@Javierete This solution is working for me - thanks! I noticed the error return for me when free space dipped below 7-8 GB for the application I'm running.

wood73 avatar Feb 19 '21 20:02 wood73

Hi Woodrow73, If it's of any value, I ended up setting the values into manual and some ridiculous amount of 360GB as the minimum and 512GB for the maximum. I also added an extra SSD and allocated all of it to Virtual memory. This solved the problem and now I can run up to 128 processes using pytorch and CUDA. I did find out that every launch of Python and pytorch, loads some ridiculous amount of memory to the RAM and then when not used often goes into the virtual memory. Anyway, just sharing my learnings.

Javierete avatar Feb 22 '21 01:02 Javierete

I ran this on my PC and encountered the issue which seems like it should be the minimal in terms of memory usage.
import tensorflow as tf print(tf.version)

I just closed several applications and the problem went away so truly seems like resource issue.

rlhull6 avatar Feb 27 '21 00:02 rlhull6

TF.txt

Can someone please assist me on this error, I am kinda new to this so please help me out. I have attached the complete error message.

Chetanvikram46 avatar Jun 01 '21 15:06 Chetanvikram46

I have managed to mitigate (although not completely solve) this issue. I posted a more detailed explanation at the StackOverflow link but basically try this:

Download: https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5

Install dependency: python -m pip install pefile

Run (for OPs paths) (NOTE: THIS WILL MODIFY YOUR DLLS [although it will back them up]): python fixNvPe.py --input C:\Users\dk12a7\Anaconda3\lib\site-packages\torch\lib\*.dll

cobryan05 avatar Oct 10 '21 18:10 cobryan05

6)- Uncheck the "Automatically... BLA BLA" checkbox

Hello, Thanks for the solution, but doesnt seem to work now. I got hp Pavilion 15-EC2150AX laptop and the settings specified doesnt appear at my side. Any sort of help will be highly appreciated.

Thanks

crazypythonista avatar Mar 06 '22 12:03 crazypythonista

Hello, Thanks for the solution, but doesnt seem to work now. I got hp Pavilion 15-EC2150AX laptop and the settings specified doesnt appear at my side. Any sort of help will be highly appreciated.

The setting name is "Automatically Manage Paging File Size For All Drives" and is at the top of the "Virtual Memory" page after clicking the 'change' button.

However, instead of making this change you should first try my fix in the comment immediately before yours, and only apply paging file size fixes if still necessary

For a description of what my fix does, see here: https://stackoverflow.com/a/69489193/213316 For a comparison of my fix against other fixes, see here: https://github.com/ultralytics/yolov3/issues/1643#issuecomment-985652432

cobryan05 avatar Mar 07 '22 15:03 cobryan05