char-rnn
char-rnn copied to clipboard
Training w/ GPU, sampling without GPU
Is the restriction that a model trained with a GPU can only be sampled with a GPU in the char-rnn code or in the torch code? It would be handy to be able to train a model on a machine with a fast GPU and then use the model on another machine.
You can call :float() on you gpu-model and it changes into a cpu model. On Saturday, June 13, 2015, John Wiseman [email protected] wrote:
Is the restriction that a model trained with a GPU can only be sampled with a GPU in the char-rnn code or in the torch code? It would be handy to be able to train a model on a machine with a fast GPU and then use the model on another machine.
— Reply to this email directly or view it on GitHub https://github.com/karpathy/char-rnn/issues/35.
yes I had this problem too: the vars are saved as cuda tensors, so you need to use cutorch and cunn to load them... it should save in float and the allow to load nets for use on cpu or gpu. in case of gpu, you can then make the :cuda()
I was just thinking about this as well. @soumith is the preferred solution to always save CPU models and explicitly convert to GPU int he sampling script if the user wants? This seems like the right way to go as culurciello mentions
Yes that seems the right way to go.
On Saturday, June 13, 2015, Andrej [email protected] wrote:
I was just thinking about this as well. @soumith https://github.com/soumith is the preferred solution to always save CPU models and explicitly convert to GPU int he sampling script if the user wants? This seems like the right way to go as culurciello mentions
— Reply to this email directly or view it on GitHub https://github.com/karpathy/char-rnn/issues/35#issuecomment-111753679.
@soumith I'm not fully comfortable with some of these API and best practices. I'm planning to iterate over all entries in proto, convert them with :float()
, save to file, and then iterate again and convert to :cuda()
. There shouldn't be issues with this idea, I believe? It seems a little wasteful since I'm shipping the model entirely GPU->CPU and then back CPU->GPU. Perhaps it's possible and for some reason better to create a clone somehow, and directly on CPU? #overlycarefulanduncertain
@karpathy we usually do things like that. Write a CPU only model and re-ship to GPU when needed. It is nice to have it that way because we might be using embedded systems or a small micro and do not have CUDA GPUs. Also for example, if you want to sample 1-2 sentences, it takes less time to process on CPU than init and process on GPU. BTW, thanks a lot for the great package. I have been studying with great detail. It is nice to have you work with Torch7 and to contribute so much!
@karpathy The only thing that I would be super careful about, especially with recurrent nets, is the weight-sharing.
Whenever you typecast it, the weight-sharing will be untied, and you might have to re-share the recurrent connections properly.
I am tracking this issue here: https://github.com/torch/nn/issues/187 Hopefully I'll get time to fix it soon, if no one does before me.
@soumith ahhh! Glad I asked, that's precisely the kind of gotcha I was afraid of. I'll keep this in mind.
I would think it would be safer to clone the net to a float net, then save it (edit: compared to, rather than converting in-place, then converting back again). I dont think the net takes up much space anyway right? Just a bunch of weights?
Oh, I see, the point is, if we convert from cuda/cl to float, before saving it, the weights will be untied? But if we directly save from cu/cl, then the weight tieing will be preserved correctly?
Yes I think I was going to do this but then decided it would be tricky due to parameter tying issues. The problem is that when you case the model to :float(), it would destroy the parameter tying. So if you try to go :float(), save, and then go back with, e.g. , :cuda(), then you're in for unpleasant surprises. I believe @soumith was going to look into this eventually, preserving the parameter sharing on casts.
Any progress here? Is losing the parameter tying only an issue if you want to continue training from a checkpoint?
i.e. I have trained models on a GPU, I'm happy to keep the 'canonical GPU' version around for further refinement if there's a way to do a one-way transformation of a particular checkpoint so that it may be sampled by CPU.
@ryaneleary : there's no reason you couldnt first implement this as a fork first, to enable experimentation, see how well it works, the lines in question are 306-315. I guess you could try something like:
local checkpoint = {}
checkpoint.protos = protos:float() -- this might be a table, so you might need to loop, like
-- checkpoint.protos = {}
-- for i=1,#protos do
-- table.insert(checkpoint.protos, protos[i]:float()
-- end
checkpoint.opt = opt:float()
checkpoint.train_losses = train_losses:float()
checkpoint.val_loss = val_loss:float() -- this might just be a scalar float anyway
checkpoint.val_losses = val_losses:float()
checkpoint.i = i
checkpoint.epoch = epoch
checkpoint.vocab = loader.vocab_mapping:float() -- might be cpu-side anyway
torch.save(savefile, checkpoint) -- no change required
I created a quick script to convert char-rnn GPU models to CPU models as a temporary solution to this issue. In the long run we'll want to always save a CPU model and ship to GPU in the sampling script, if desired by the user. I'll have to make sure this is done in way that doesn't break parameter sharing during training time.
commit is here: https://github.com/karpathy/char-rnn/commit/86a8eddbb8822bdcf4e42689dfab907c3bd59929
also added mention to docs.
@soumith Hey Soumith RE: this issue with char-rnn, I think there is support now in Torch that doesn't destroy parameter sharing when model is shipped between CPU GPU. Though I'm reluctant to make use of it because it requires a very fresh Torch. Another solution to this issue would be if there was a way to do protos.rnn:clone():float()
in a single call right before saving a model checkpoint, so that the rnn isn't intermediately fully cloned on GPU (which could lead to running out of GPU memory). Is there any way to do this clone© op without additional GPU memory used?
One way to make sure everything is okay, is to add simple assertions for checking weight sharing. That way, if someone is on an older torch, they can see the assertion and upgrade.
At the moment there's no way to clone+float without cloning on GPU (and use extra memory).
Question: can the weights be obtained by doing net:getParameters():float()
? Per my understanding, getParameters
will create a single Storage, containing all the weights, and then :float()
will simply ship those to main memory, without creating any additional copies on the GPU, at that time?