dreamer-pytorch icon indicating copy to clipboard operation
dreamer-pytorch copied to clipboard

Multi gpu

Open AliengirlLiv opened this issue 4 years ago • 2 comments

This runs on multiple GPUS. That said, there are some sketchy things:

  • I just chose num_cpus equal to the number of cpus on my desktop, but IDK what the best number is.
  • IDK whether any of the other arguments in make_affinities are important.
  • Need to compare training times to see whether it trains faster
  • Need to try a run to confirm accuracy isn't impacted
  • Not sure if calling model.module is the recommended way to access the model during multi-gpu training.
  • When using multiple GPUS, for some reason in agent.py, the RSSMState we pass into a function turns into a tuple. No clue why. Re-wrapping the tuple in RSSM state seems to fix it, but I'm not sure why that's necessary.

Side note: on my computer at least, the model doesn't use even one GPU fully, so increasing the batch size may be an easier way to get a speed boost.

AliengirlLiv avatar Apr 28 '20 20:04 AliengirlLiv

Codecov Report

Merging #64 into master will increase coverage by 0.10%. The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
+ Coverage   68.73%   68.84%   +0.10%     
==========================================
  Files          24       24              
  Lines        1126     1133       +7     
==========================================
+ Hits          774      780       +6     
- Misses        352      353       +1     
Flag Coverage Δ
#unittests 68.84% <80.00%> (+0.10%) :arrow_up:
Impacted Files Coverage Δ
dreamer/algos/dreamer_algo.py 86.31% <75.00%> (-0.18%) :arrow_down:
dreamer/models/agent.py 91.46% <100.00%> (+0.21%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5644b6e...e057d4b. Read the comment docs.

codecov[bot] avatar Apr 28 '20 20:04 codecov[bot]

make_affinity errors on windows:

'cat' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
  File "C:/Users/Julius/Documents/GitHub/dreamer-pytorch/main.py", line 106, in <module>
    gpu_per_run=args.num_gpus,  # How many GPUs to parallelize one run across.
  File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\site-packages\rlpyt\utils\launching\affinity.py", line 162, in make_affinity
    return affinity_from_code(encode_affinity(run_slot=run_slot, **kwargs))
  File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\site-packages\rlpyt\utils\launching\affinity.py", line 111, in encode_affinity
    n_socket = get_n_socket()
  File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\site-packages\rlpyt\utils\launching\affinity.py", line 171, in get_n_socket
    shell=True)))
  File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'cat /proc/cpuinfo | grep "physical id" | sort -u | wc -l' returned non-zero exit status 255.

juliusfrost avatar Apr 29 '20 04:04 juliusfrost