UCPhrase-exp
UCPhrase-exp copied to clipboard
Getting "Killed" error when run on kp20k Dataset
When I run python exp.py --gpu 0 --dir_data ../data/kp20k
, I get the Killed error.
While debugging I found the error happens at the return torch.tensor(padded_tensor, dtype=torch.float32)
(line 66 of src/model_att/model.py
)
Is there a way to fix this?
Hi, looks like a resource issue, if the job was not manually terminated by a server admin. Could you please check the memory usage and the swap space? If it was due to the lack of resource, you can try to reduce the batch size at this line. Please have a try and let us know if the issue still exists, and we will see if we can optimize the resource usage (sorry the experimental code is a bit messy now). Thanks!
Hi! I did try reducing the batch size (from 2048 till 4), but I got the same error again. On kp20k, the numpy array has the dimensions of (6529257, 36, 10, 10) resulting in a total size of 23505325200. And the process gets killed when converting this to a torch tensor. Let me know if there is an optimization possible to run it on my system. Thanks :)
Thank you for the feedback! Sorry it was not the batch size that matters, since at this step we were loading all training instances to memory, so it may exceed memory limit. To verify this, you check the memory usage, or try to add instances = instances[:100]
to see if the error goes away. It would help if you can test it on a machine with larger memory for now.
Hi!
Actually, I also encounter this problem when I run on kpTimes dataset. I used psutil to monitor the memory. And I'm really confused that why 76GB free memory is not enough. I changed the pad_attention_maps function in src/model_att/model.py and the code is as follows:
And I got the result like that:
It showed that the assignment of padded_tensor cost 40GB, which is also a little bit confusing to me. And the code
final=torch.tensor(padded_tensor, dtype=torch.float32)
just run out of memory and caused the program be killed.
I think it is mainly because we loaded the entire training dataset to memory for experiments. It wasn't a bottleneck for our server but might be an issue without sufficient free memory. We will try to find time to fix it.