dreamer-pytorch
dreamer-pytorch copied to clipboard
Multi gpu
This runs on multiple GPUS. That said, there are some sketchy things:
- I just chose num_cpus equal to the number of cpus on my desktop, but IDK what the best number is.
- IDK whether any of the other arguments in
make_affinities
are important. - Need to compare training times to see whether it trains faster
- Need to try a run to confirm accuracy isn't impacted
- Not sure if calling
model.module
is the recommended way to access the model during multi-gpu training. - When using multiple GPUS, for some reason in agent.py, the RSSMState we pass into a function turns into a tuple. No clue why. Re-wrapping the tuple in RSSM state seems to fix it, but I'm not sure why that's necessary.
Side note: on my computer at least, the model doesn't use even one GPU fully, so increasing the batch size may be an easier way to get a speed boost.
Codecov Report
Merging #64 into master will increase coverage by
0.10%
. The diff coverage is80.00%
.
@@ Coverage Diff @@
## master #64 +/- ##
==========================================
+ Coverage 68.73% 68.84% +0.10%
==========================================
Files 24 24
Lines 1126 1133 +7
==========================================
+ Hits 774 780 +6
- Misses 352 353 +1
Flag | Coverage Δ | |
---|---|---|
#unittests | 68.84% <80.00%> (+0.10%) |
:arrow_up: |
Impacted Files | Coverage Δ | |
---|---|---|
dreamer/algos/dreamer_algo.py | 86.31% <75.00%> (-0.18%) |
:arrow_down: |
dreamer/models/agent.py | 91.46% <100.00%> (+0.21%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 5644b6e...e057d4b. Read the comment docs.
make_affinity
errors on windows:
'cat' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
File "C:/Users/Julius/Documents/GitHub/dreamer-pytorch/main.py", line 106, in <module>
gpu_per_run=args.num_gpus, # How many GPUs to parallelize one run across.
File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\site-packages\rlpyt\utils\launching\affinity.py", line 162, in make_affinity
return affinity_from_code(encode_affinity(run_slot=run_slot, **kwargs))
File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\site-packages\rlpyt\utils\launching\affinity.py", line 111, in encode_affinity
n_socket = get_n_socket()
File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\site-packages\rlpyt\utils\launching\affinity.py", line 171, in get_n_socket
shell=True)))
File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\subprocess.py", line 411, in check_output
**kwargs).stdout
File "C:\Users\Julius\Anaconda3\envs\rlpyt\lib\subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'cat /proc/cpuinfo | grep "physical id" | sort -u | wc -l' returned non-zero exit status 255.