kaldi icon indicating copy to clipboard operation
kaldi copied to clipboard

Cudnn 3D Convolution Component

Open galv opened this issue 9 years ago • 123 comments

See #769 for some discussion points.

I realize now there are some commits at the beginning which are not relevant to CUDNN. Sorry, but I'd rather not look at removing those tonight; Tom's code mixes bash script and C++ changes.

galv avatar May 09 '16 06:05 galv

@jtrmal @freewym @tomkocse Could one of you review this PR ?

vijayaditya avatar May 09 '16 06:05 vijayaditya

@freewym Please let me know once your review is done.

vijayaditya avatar May 13 '16 02:05 vijayaditya

@freewym Could you please let us know if you are still reviewing this.

vijayaditya avatar May 17 '16 07:05 vijayaditya

Sorry I might not have time until the end of this month. It would be great if someone else could do it during this time, or I can do it after that.

On Tue, May 17, 2016 at 3:32 AM, Vijayaditya Peddinti < [email protected]> wrote:

@freewym https://github.com/freewym Could you please let us know if you are still reviewing this.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/pull/770#issuecomment-219640876

Yiming Wang Department of Computer Science The Johns Hopkins University 3400 N. Charles St. Baltimore, MD 21218

freewym avatar May 17 '16 07:05 freewym

@galv I got an Segmentation fault when i am running nnet-component-test on CLSP:

LOG (UnitTestNnetComponent():nnet-component-test.cc:464) CuDNN3DConvolutionComponent, input-dim=9680, output-dim=3750, learning-rate=0.003, input-x-dim=16, input-y-dim=11, input-z-dim=11, filt-x-dim=2, filt-y-dim=5, filt-z-dim=10, filt-x-stride=1, filt-y-stride=2, filt-z-stride=2, x-zero-pad=0, y-zero-pad=0, z-zero-pad=0, x-upscale=1, y-upscale=1, z-upscale=1, input-vectorization=1, input-num-filters=5, num-filters=10, filter-params-rms=0.1298, bias-params-{mean,stddev}=-0.05663,0.5506

Have you encountered this and tell me how to resolve this error ?

tomkocse avatar May 18 '16 02:05 tomkocse

It looks like you're running nnet-component-test. Ah I believe it is segfaulting because it is trying to run in CPU mode, not GPU mode. I hope I mentioned this earlier in the PR. Try commenting out the part of that executable that selects to run in CPU mode (or make it always run in GPU mode) and see what happens.

Otherwise, run cuda-memcheck ./nnet3-component-test and valgrind ./nnet3-component-test and tell me the results.

galv avatar May 18 '16 02:05 galv

By the way, I tested this on some GeForce GPU not on the CLSP cluster; but I doubt that is the cause of the issue.

galv avatar May 18 '16 02:05 galv

The nnet-component-test is passed when i skip the CPU mode.

I am now changing the user interface (steps/nnet3/components.py) to use your cuDNN Convolution component to see if i can repeat my previous result with a faster speed.

tomkocse avatar May 18 '16 03:05 tomkocse

Yes. But to be sure, here is how I define "x-stride": the number of elements moved along the x dimension after each convolution is performed, in preparation for the next convolution. Also note that I allow many options to be omitted. Hopefully that makes things easier for you.

Also, are you sure that inference (decoding) will be okay? I think I recall being able to do inference with a GPU, but I've never done it myself because I have not needed to. I realize that changing the interface from the original ConvolutionComponent may have been a bad idea. Sorry.

galv avatar May 18 '16 06:05 galv

We can decode with GPU though we usually decode with CPU. We can fix the CPU issue after we verify the result.

tomkocse avatar May 18 '16 09:05 tomkocse

I got an error while initializing the acoustic model: nnet3-am-train-transitions - 'ark:gunzip -c exp/tri4_ali_nodup_sp/ali.*.gz|' exp/nnet3/tdnn_3dcnn_sp/0.mdl nnet3-am-init exp/tri4_ali_nodup_sp/final.mdl exp/nnet3/tdnn_3dcnn_sp/0.raw - LOG (nnet3-am-init:InputDim():nnet-cudnn-simple-component.cc:311) input_num_filters_ 1 input_x_dim_ 5 input_y_dim_ 40 input_z_dim_ 1 ERROR (nnet3-am-init:GetOutputDims():nnet-cudnn-simple-component.cc:297) cudnnStatus_t 3 : "CUDNN_STATUS_BAD_PARAM" returned from 'cudnnGetConvolutionNdForwardOutputDim(conv_desc_, in_desc, filter_desc_, kConvolutionDimension_ + 2, output_dims )'

Have you encounter this kind of error ?

tomkocse avatar May 18 '16 09:05 tomkocse

@galv , i solved the "CUDNN_STATUS_BAD_PARAM" error returned from 'cudnnGetConvolutionNdForwardOutputDim' by changing the 3D filter descriptor into 5D. Now it can get back the collect filter dimensions

tomkocse avatar May 18 '16 16:05 tomkocse

Hi, I am testing the component with a small config and I am having an error in the training: KALDI_ASSERT: at nnet3-train:HouseBackward:qr.cc:124, failed: KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs."

I initialize the component with the following line of config: component name=L0_conv type=CuDNN3DConvolutionComponent input-x-dim=5 input-y-dim=40 input-z-dim=1 filt-x-dim=4 filt-y-dim=4 filt-z-dim=1 filt-x-stride=1 filt-y-stride=1 filt-z-stride=1 input-num-filters=1 num-filters=32

Does anyone has an idea on solving the fault ?

tomkocse avatar May 19 '16 02:05 tomkocse

That can happen due to parameter divergence (the warning is from the natural gradient code). But there should be warnings or info messages printed prior to that, about limiting the parameter change. See if reducing the max-change or the learning rate helps.

Dan

On Wed, May 18, 2016 at 10:40 PM, tomkocse [email protected] wrote:

Hi, I am testing the component with a small config and I am having an error in the training: KALDI_ASSERT: at nnet3-train:HouseBackward:qr.cc:124, failed: KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs."

I initialize the component with the following line of config: component name=L0_conv type=CuDNN3DConvolutionComponent input-x-dim=5 input-y-dim=40 input-z-dim=1 filt-x-dim=4 filt-y-dim=4 filt-z-dim=1 filt-x-stride=1 filt-y-stride=1 filt-z-stride=1 input-num-filters=1 num-filters=32

Does anyone has an idea on solving the fault ?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/pull/770#issuecomment-220213556

danpovey avatar May 19 '16 02:05 danpovey

FYI, the NVidia guys mentioned that they have released v5 of the toolkit today.

danpovey avatar May 19 '16 18:05 danpovey

They mentioned something about a many-times speedup for LSTMs-- have a look to see what they have regarding that, maybe it's new. Obviously for nnet3 it's unlikely it will be a natural slot-in.

danpovey avatar May 19 '16 18:05 danpovey

Busy finishing up an exam, but I will mention that CUDNN v5 requires CUDA 7.5. I think CLSP has 7.0

On Thu, May 19, 2016 at 2:18 PM, Daniel Povey [email protected] wrote:

They mentioned something about a many-times speedup for LSTMs-- have a look to see what they have regarding that, maybe it's new. Obviously for nnet3 it's unlikely it will be a natural slot-in.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/pull/770#issuecomment-220408716

Daniel Galvez

galv avatar May 19 '16 18:05 galv

@galv Do you have time to add a function in the unit test to check if the cuDNNConvolution and the original Convolution component give the same update to the filter parameters according to the same learning rate?

tomkocse avatar May 20 '16 03:05 tomkocse

@tomkocse Hi Tom, I'm guessing that you're asking this because you are unable to get similar training convergence between the CUDNN3DConvolution component and the existing Convolution component.

By the way, you proposed changing the serialization interface; for changes like these you are free to push commits to this branch yourself. Wait... I realize you may not have write access to my personal kaldi repository. Never mind.

Anyway, I have a backlog here; I will be finished with exams on May 23rd. I am sorry for my delay. I will see about making the test you propose and reading over other comments, including @freewym's this weekend. I may not be done until after May 23rd though...

galv avatar May 20 '16 06:05 galv

@galv Yes it seems the training convergence is different between the two components, so it will be good to have a test checking it.

Let's feel free to resume the PR after May 23, i will try to figure out other potential problems in the mean time.

tomkocse avatar May 20 '16 07:05 tomkocse

Alright, finally back on this. Sorry for delay. I'll start by considering the rest of @freewym's comments.

galv avatar May 26 '16 20:05 galv

@tomkocse Can you try the most recent code? I just pushed a series of commits. Again, let me know if there are problems.

I have not added a test that ConvolutionComponent and CuDNN3DConvolutionComponent are the same in training, but I did fix the problems you and Yiming mentioned. I know that you fixed the problem yourself, but there were a few places where I mistook the filter dimension to be 3 instead of 5.

I also changed the interface to use "step" instead of "stride" in the config line interface. See a086686.

Can you let me know what your branch you have been using in your repo?

galv avatar May 27 '16 03:05 galv

@galv I am using the branch that is the same to this PR.

tomkocse avatar May 27 '16 09:05 tomkocse

@tomkocse I mean, in your personal repository (this page), what branch are you using? I am curious what changes you made to make the 3d -> 5d fix. It is possible you did not fix everything correctly, since the code related to the fix is split across three sections of code. I think this PR's branch may fix the problems you experienced with training convergence.

galv avatar May 27 '16 16:05 galv

@galv I created a new local repository by cloning your branch, what i did is : git clone https://github.com/galv/kaldi.git --branch cudnn-2

tomkocse avatar May 28 '16 01:05 tomkocse

Okay. It would be helpful to me if you pushed your local work to your personal remote branch occasionally. I would like to see your python script for creating 3D components, in case I want to experiment myself. I hope that is not a big hassle for you.

On Fri, May 27, 2016 at 9:23 PM, tomkocse [email protected] wrote:

@galv https://github.com/galv I created a new local repository by cloning your branch, what i did is : git clone https://github.com/galv/kaldi.git --branch cudnn-2

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/pull/770#issuecomment-222282717, or mute the thread https://github.com/notifications/unsubscribe/AEi_UProKsTtjplFuuQoSKlsR1Uf1pSTks5qF5kNgaJpZM4IZ2_n .

Daniel Galvez

galv avatar May 28 '16 03:05 galv

OK, that will be fine.

tomkocse avatar May 28 '16 05:05 tomkocse

@galv Please view the change that i made to you code at https://github.com/galv/kaldi/pull/1

You can experience the whole process of creating and training using the cudnnConvolution component by running egs/swbd/s5c/local/nnet3/run_tdnn_3dcnn.sh

tomkocse avatar May 31 '16 06:05 tomkocse

@galv Could you please tell me what is the default input/output vectorization order (e.g. yzx , xyz ) of the cudnnConvolution component ? I am asking because i have to set the correct input parameters for the component. Usually after the splicing layer, the tensor format is yzx where y is the feature dimension (40), z is the deltadelta (3 if used, otherwise 1) and x is splicing (5 in our case) In the use of original component, i can set: input-x-dim=5 input-y-dim=40 input-z-dim=1 input-vectorization-order=yzx to match the situation. I will have to rearrange the xyz if cudnn have another assumption about the input vectorization order. I also have to know its output vectorization order so that i can set the parameter of the next layer.

BTW, do cudnn routine provide an option to change to change the input/output vectorization order?

tomkocse avatar Jun 01 '16 04:06 tomkocse

Could you please tell me what is the default input/output vectorization order (e.g. yzx , xyz ) of the cudnnConvolution component ?

The default vectorization order for both input and output is zyx for CUDNN. Right now my code only supports this vectorization. You can see the line: KALDI_ASSERT(input_vectorization_ == kZyx && "Only zyx vectorization supported right now."); But this is solvable... see below:

BTW, do cudnn routine provide an option to change to change the input/output vectorization order?

I think so. cudnnTransformTensor provides exactly the functionality we need to turn yzx vectorization into zyx vectorization.

Anyway, vectorization is definitely a problem that the nnet3 component should handle, not the python script.

Also, you can find more details in sections 2.3 and 4.45 of the manual. If something I said sounds strange or the manual is confusing, let me know and we can talk about it.

galv avatar Jun 01 '16 05:06 galv