char-rnn icon indicating copy to clipboard operation
char-rnn copied to clipboard

Training w/ GPU, sampling without GPU

Open wiseman opened this issue 9 years ago • 17 comments

Is the restriction that a model trained with a GPU can only be sampled with a GPU in the char-rnn code or in the torch code? It would be handy to be able to train a model on a machine with a fast GPU and then use the model on another machine.

wiseman avatar Jun 13 '15 20:06 wiseman

You can call :float() on you gpu-model and it changes into a cpu model. On Saturday, June 13, 2015, John Wiseman [email protected] wrote:

Is the restriction that a model trained with a GPU can only be sampled with a GPU in the char-rnn code or in the torch code? It would be handy to be able to train a model on a machine with a fast GPU and then use the model on another machine.

— Reply to this email directly or view it on GitHub https://github.com/karpathy/char-rnn/issues/35.

soumith avatar Jun 13 '15 20:06 soumith

yes I had this problem too: the vars are saved as cuda tensors, so you need to use cutorch and cunn to load them... it should save in float and the allow to load nets for use on cpu or gpu. in case of gpu, you can then make the :cuda()

culurciello avatar Jun 13 '15 21:06 culurciello

I was just thinking about this as well. @soumith is the preferred solution to always save CPU models and explicitly convert to GPU int he sampling script if the user wants? This seems like the right way to go as culurciello mentions

karpathy avatar Jun 13 '15 21:06 karpathy

Yes that seems the right way to go.

On Saturday, June 13, 2015, Andrej [email protected] wrote:

I was just thinking about this as well. @soumith https://github.com/soumith is the preferred solution to always save CPU models and explicitly convert to GPU int he sampling script if the user wants? This seems like the right way to go as culurciello mentions

— Reply to this email directly or view it on GitHub https://github.com/karpathy/char-rnn/issues/35#issuecomment-111753679.

soumith avatar Jun 13 '15 22:06 soumith

@soumith I'm not fully comfortable with some of these API and best practices. I'm planning to iterate over all entries in proto, convert them with :float(), save to file, and then iterate again and convert to :cuda(). There shouldn't be issues with this idea, I believe? It seems a little wasteful since I'm shipping the model entirely GPU->CPU and then back CPU->GPU. Perhaps it's possible and for some reason better to create a clone somehow, and directly on CPU? #overlycarefulanduncertain

karpathy avatar Jun 14 '15 17:06 karpathy

@karpathy we usually do things like that. Write a CPU only model and re-ship to GPU when needed. It is nice to have it that way because we might be using embedded systems or a small micro and do not have CUDA GPUs. Also for example, if you want to sample 1-2 sentences, it takes less time to process on CPU than init and process on GPU. BTW, thanks a lot for the great package. I have been studying with great detail. It is nice to have you work with Torch7 and to contribute so much!

culurciello avatar Jun 14 '15 18:06 culurciello

@karpathy The only thing that I would be super careful about, especially with recurrent nets, is the weight-sharing.

Whenever you typecast it, the weight-sharing will be untied, and you might have to re-share the recurrent connections properly.

I am tracking this issue here: https://github.com/torch/nn/issues/187 Hopefully I'll get time to fix it soon, if no one does before me.

soumith avatar Jun 14 '15 20:06 soumith

@soumith ahhh! Glad I asked, that's precisely the kind of gotcha I was afraid of. I'll keep this in mind.

karpathy avatar Jun 14 '15 21:06 karpathy

I would think it would be safer to clone the net to a float net, then save it (edit: compared to, rather than converting in-place, then converting back again). I dont think the net takes up much space anyway right? Just a bunch of weights?

hughperkins avatar Jul 04 '15 06:07 hughperkins

Oh, I see, the point is, if we convert from cuda/cl to float, before saving it, the weights will be untied? But if we directly save from cu/cl, then the weight tieing will be preserved correctly?

hughperkins avatar Jul 04 '15 06:07 hughperkins

Yes I think I was going to do this but then decided it would be tricky due to parameter tying issues. The problem is that when you case the model to :float(), it would destroy the parameter tying. So if you try to go :float(), save, and then go back with, e.g. , :cuda(), then you're in for unpleasant surprises. I believe @soumith was going to look into this eventually, preserving the parameter sharing on casts.

karpathy avatar Jul 04 '15 12:07 karpathy

Any progress here? Is losing the parameter tying only an issue if you want to continue training from a checkpoint?

i.e. I have trained models on a GPU, I'm happy to keep the 'canonical GPU' version around for further refinement if there's a way to do a one-way transformation of a particular checkpoint so that it may be sampled by CPU.

ryanleary avatar Jul 09 '15 21:07 ryanleary

@ryaneleary : there's no reason you couldnt first implement this as a fork first, to enable experimentation, see how well it works, the lines in question are 306-315. I guess you could try something like:

        local checkpoint = {}
        checkpoint.protos = protos:float()  -- this might be a table, so you might need to loop, like
              -- checkpoint.protos = {}
              -- for i=1,#protos do
              --   table.insert(checkpoint.protos, protos[i]:float()
              -- end
        checkpoint.opt = opt:float()
        checkpoint.train_losses = train_losses:float()
        checkpoint.val_loss = val_loss:float() -- this might just be a scalar float anyway
        checkpoint.val_losses = val_losses:float()
        checkpoint.i = i
        checkpoint.epoch = epoch
        checkpoint.vocab = loader.vocab_mapping:float() -- might be cpu-side anyway
        torch.save(savefile, checkpoint) -- no change required

hughperkins avatar Jul 09 '15 22:07 hughperkins

I created a quick script to convert char-rnn GPU models to CPU models as a temporary solution to this issue. In the long run we'll want to always save a CPU model and ship to GPU in the sampling script, if desired by the user. I'll have to make sure this is done in way that doesn't break parameter sharing during training time.

commit is here: https://github.com/karpathy/char-rnn/commit/86a8eddbb8822bdcf4e42689dfab907c3bd59929

also added mention to docs.

karpathy avatar Aug 05 '15 22:08 karpathy

@soumith Hey Soumith RE: this issue with char-rnn, I think there is support now in Torch that doesn't destroy parameter sharing when model is shipped between CPU GPU. Though I'm reluctant to make use of it because it requires a very fresh Torch. Another solution to this issue would be if there was a way to do protos.rnn:clone():float() in a single call right before saving a model checkpoint, so that the rnn isn't intermediately fully cloned on GPU (which could lead to running out of GPU memory). Is there any way to do this clone&copy op without additional GPU memory used?

karpathy avatar Sep 20 '15 21:09 karpathy

One way to make sure everything is okay, is to add simple assertions for checking weight sharing. That way, if someone is on an older torch, they can see the assertion and upgrade.

At the moment there's no way to clone+float without cloning on GPU (and use extra memory).

soumith avatar Sep 21 '15 02:09 soumith

Question: can the weights be obtained by doing net:getParameters():float()? Per my understanding, getParameters will create a single Storage, containing all the weights, and then :float() will simply ship those to main memory, without creating any additional copies on the GPU, at that time?

hughperkins avatar Jan 06 '16 12:01 hughperkins