MemoryError. When train DistributedRL after an hour.

Open JazzTao opened this issue 6 years ago • 1 comments

Your issue may already be reported! Please make sure to search all open and closed issues before starting a new one.

Please fill out the sections below so we can understand your issue better and resolve it quickly.

Problem description

When I train DistributedRL: https://github.com/Microsoft/AutonomousDrivingCookbook/blob/master/DistributedRL/LaunchLocalTrainingJob.ipynb it works at first. After about one hour,I get the error below. (PS: Actually I changed “threshold=np.nan” to "threshold=sys.maxsize" which in the "distributed_agent.py" Line 609 to let it work at the first time I run “train.bat”. I don't know if it matters.)

My english is not very good. I don't know if I express it clearly.

Problem details

Start time: 2019-04-15 07:23:33.036246, end time: 2019-04-15 07:23:45.755073 Percent random actions: 0.10204081632653061 Num total actions: 98 Generating 98 minibatches... Sampling Experiences. Publishing AirSim Epoch. Publishing epoch data and getting latest model from parameter server... Traceback (most recent call last): File "distributed_agent.py", line 643, in agent.start() File "distributed_agent.py", line 80, in start self.__run_function() File "distributed_agent.py", line 175, in __run_function self.__publish_batch_and_update_model(sampled_experiences, frame_count) File "distributed_agent.py", line 401, in __publish_batch_and_update_model gradients = self.__model.get_gradient_update_from_batches(batches) File "E:\File\Train_Airsim\AD_Cookbook_AirSim\python36_DRL\Share\scripts_downpour\app\rl_model.py", line 135, in get_gradient_update_from_batches post_states = np.array(batches['post_states']) MemoryError

Experiment/Environment details

Tutorial used: DistributedRL
Environment used: neighborhood
Versions of artifacts used (if applicable): tensorflow 1.13.1 ;Python 3.6.2,;Keras 2.1.2;numpy 1.16.2 *The state of my harddisk：C：9.48GB available，E : 25.4GB （DistributedRL‘sworkspace） available *My computer equipment：GPU-GTX960M-4G ，Memory-8G，CPU i5-6300HQ

Apr 15 '19 08:04 JazzTao

What is your solution? I came up with the same issue on the newest version. Instead of changing “threshold=np.nan” to "threshold=sys.maxsize", i changed it to "threshold=np.inf" in order to run the script without error coming.

Thank U!

Nov 18 '21 11:11 Zhenlin-Xu