MemoryError. When train DistributedRL after an hour.
Your issue may already be reported! Please make sure to search all open and closed issues before starting a new one.
Please fill out the sections below so we can understand your issue better and resolve it quickly.
Problem description
When I train DistributedRL: https://github.com/Microsoft/AutonomousDrivingCookbook/blob/master/DistributedRL/LaunchLocalTrainingJob.ipynb it works at first. After about one hour,I get the error below. (PS: Actually I changed “threshold=np.nan” to "threshold=sys.maxsize" which in the "distributed_agent.py" Line 609 to let it work at the first time I run “train.bat”. I don't know if it matters.)
My english is not very good. I don't know if I express it clearly.
Problem details
Start time: 2019-04-15 07:23:33.036246, end time: 2019-04-15 07:23:45.755073
Percent random actions: 0.10204081632653061
Num total actions: 98
Generating 98 minibatches...
Sampling Experiences.
Publishing AirSim Epoch.
Publishing epoch data and getting latest model from parameter server...
Traceback (most recent call last):
File "distributed_agent.py", line 643, in
Experiment/Environment details
- Tutorial used: DistributedRL
- Environment used: neighborhood
- Versions of artifacts used (if applicable): tensorflow 1.13.1 ;Python 3.6.2,;Keras 2.1.2;numpy 1.16.2 *The state of my harddisk:C:9.48GB available,E : 25.4GB (DistributedRL‘sworkspace) available *My computer equipment:GPU-GTX960M-4G ,Memory-8G,CPU i5-6300HQ
What is your solution? I came up with the same issue on the newest version. Instead of changing “threshold=np.nan” to "threshold=sys.maxsize", i changed it to "threshold=np.inf" in order to run the script without error coming.
Thank U!