amazon-dsstne icon indicating copy to clipboard operation
amazon-dsstne copied to clipboard

Allocate failed out of memory when predicting user's click data

Open qubingxin opened this issue 8 years ago • 6 comments

We use some data, include user's id and corresponding click article, to train the recommended system.

  1. Config we used is ./samples/movielens/config.json
  2. First, one day's data can be used to train and test the model successfully. Finally, the model size is 4.1GB
  3. We did another attempt to train the model with three days of data. Training is successful and the model size is smaller than before, its size is only 3.5GB, but when predicting data, the error occurred: GpuBuffer:: Allocate failed out of memory, predict: ../engine/GpuTypes.h:463: void GpuBuffer<T>::Allocate() [with T = float]: Assertion '0' failed. What should we do to troubleshot this error and solve this problem?

qubingxin avatar Jun 14 '17 01:06 qubingxin

What GPU are you using?

On Jun 13, 2017 6:59 PM, "qubingxin" [email protected] wrote:

We use some data, include user's id and corresponding click article, to train the recommended system.

  1. Config we used is ./samples/movielens/config.json
  2. First, one day's data can be used to train and test the model successfully. Finally, the model size is 4.1GB
  3. We did another attempt to train the model with three days of data. Training is successful and the model size is smaller than before, its size is only 3.5GB, but when predicting data, the error occurred: GpuBuffer:: Allocate failed out of memory, predict: ../engine/GpuTypes.h:463: void GpuBuffer::Allocate() [with T = float]: Assertion '0' failed. What should we do to troubleshot this error and solve this problem?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amzn/amazon-dsstne/issues/132, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNK9vE8Ollhg146wskxlxu1WeLzL5n1ks5sDz6agaJpZM4N5StZ .

scottlegrand avatar Jun 14 '17 02:06 scottlegrand

Tesla M40 24GB

qubingxin avatar Jun 14 '17 03:06 qubingxin

Cut your batch size in half and see if that fixes this. I am somewhat shooting in the dark here.

On Tue, Jun 13, 2017 at 8:30 PM, qubingxin [email protected] wrote:

Tesla M40 24GB

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amzn/amazon-dsstne/issues/132#issuecomment-308309879, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNK9vnNznOdsX_kHB_wUWlXaBoDPC0Sks5sD1PggaJpZM4N5StZ .

scottlegrand avatar Jun 14 '17 04:06 scottlegrand

Cut the batch size from 256 to 1, the same error occurred.

qubingxin avatar Jun 14 '17 05:06 qubingxin

Weird, could you rebuild uncommenting out //#define MEMTRACKING in GpuTypes.h and send the output? The other option is to run across multiple GPUs with MPI if you have them.

scottlegrand avatar Aug 23 '17 16:08 scottlegrand

Ping?

slegrandA9 avatar Sep 21 '17 01:09 slegrandA9