FasterTransformer Drop in Performance with TP>1, batch size and long generations

Noticed a small drop in performance (<1%) for the same model when we use tensor sharding. For example, A 355 million parameter model with tp=1, gets an accuracy of 47.30 on FasterTransformer (expected performance for my model) The same model with tp=4, gets an accuracy of 46.9 on FasterTransformer. Has this been noticed before.

I am using the pytorch docker nvcr.io/nvidia/pytorch:20.12-py3 contains the PyTorch 1.8.0 and python 3.8

Mar 25 '22 20:03 bharatv007

Can you try different runtime settings like topk/topp/random seed? I guess the reason is because using tensor parallelism modifies the shape of GEMM, and hence the results are little different and get different accuracy. But in general, they should be on same level and sometime one is better, sometime another one is better.

Mar 26 '22 00:03 byshiue

We also anlaysed batching, we find that for batch >8, the accuracy is a bit stochastic between test, whereas for batch <=8, the results are consistent. This for tp=1. batch 1: 0.4740 batch 8: 0.4740 batch 16: 0.4604 batch 16 again: 0.3530 batch 16 again: 0.4670

The variance is too high, has this been observed before? Wanna understand if its something on our end? Thank you

Apr 14 '22 16:04 bharatv007

Can you try on latest main branch? If you still encounter same problem, please provide the reproduce steps to reproduce this issue, thanks.

Apr 15 '22 16:04 byshiue

We built on top of the dev_5.0 branch, are there any chnages in main that is different to dev_5.0 branch. Important changes we made are

We get the logits from context decoder (gpt_context_decoder) too and pass it through a softmax layer and to get log likelihoods of the initial prompt too (we wrote a custom kernel for the log of softmax).
We also made a similar log likelihood step for the gpt_decoder_ too.

The accuracy is computed from the log likelihoods and the task we use is Hellaswag. Is there another better way for us to share the code (without making a PR)?

Apr 18 '22 14:04 bharatv007

We may have fixed some bugs affecting the stability. The latest branch also provide some similar features in the gpt model, they may be helpful. I am not sure the "again" you mean. But if you run the program like "gpt_example" many times with same inputs, they should generate same results. If you still encounter problem of un-stability, we can try to reproduce this problem and solve it first.

Apr 18 '22 14:04 byshiue

Thanks for the quick reply! I will try merging with main and test it!

Apr 18 '22 14:04 bharatv007

We may have fixed some bugs affecting the stability. The latest branch also provide some similar features in the gpt model, they may be helpful. I am not sure the "again" you mean. But if you run the program like "gpt_example" many times with same inputs, they should generate same results. If you still encounter problem of un-stability, we can try to reproduce this problem and solve it first.

Step 2. refers to getting the logprobs of the generated tokens too.

Apr 18 '22 14:04 bharatv007

I am using the pytorch docker nvcr.io/nvidia/pytorch:20.12-py3 contains the PyTorch 1.8.0 and python 3.8. Should this be changed to run on the main?

Apr 18 '22 15:04 bharatv007

We only verified on the docker image we mention in the document.

Apr 18 '22 23:04 byshiue

We also observe that the sampling/generation quality degrades after 1024 tokens (we compare the generations against a default tensorflow-2.8, we dont see this issue, samples are not gibberish), it starts outputting giberrish sentences. Is there some instability in topp sampling after 1024 tokens? dev_5.0 branch

We just run python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --fp16 --ckpt_path /355m-export --output_len=1500 --max_seq_len=2048

Apr 19 '22 15:04 bharatv007

We may have fixed some bugs affecting the stability. The latest branch also provide some similar features in the gpt model, they may be helpful. I am not sure the "again" you mean. But if you run the program like "gpt_example" many times with same inputs, they should generate same results. If you still encounter problem of un-stability, we can try to reproduce this problem and solve it first.

Can you please summarize what are the main bugs that were fixed for stability?

Apr 19 '22 15:04 bharatv007

We also observe that the sampling/generation quality degrades after 1024 tokens (we compare the generations against a default tensorflow-2.8, we dont see this issue, samples are not gibberish), it starts outputting giberrish sentences. Is there some instability in topp sampling after 1024 tokens? dev_5.0 branch

are these kind of stabilities fixed in main, like long context generations (>1024 tokens) with topp etc?

Apr 19 '22 21:04 bharatv007

We also observe that the sampling/generation quality degrades after 1024 tokens (we compare the generations against a default tensorflow-2.8, we dont see this issue, samples are not gibberish), it starts outputting giberrish sentences. Is there some instability in topp sampling after 1024 tokens? dev_5.0 branch

We just run python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --fp16 --ckpt_path /355m-export --output_len=1500 --max_seq_len=2048

we are still adding the changes we made in the dev_5.0 branch to main. Here is one way you can reproduce the generation quality issue that I mentioned above. When we run python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --fp16 --ckpt_path /355m-export --top_p=1.0 --top_k=0 --temperature=1.0 --output_len=1500 --max_seq_len=2048 we see that the models generation is horrible and this problem does not occur when we run on vanilla Tensorflow

Apr 19 '22 23:04 bharatv007

Can you try on latest main branch? If you still encounter same problem, please provide the reproduce steps to reproduce this issue, thanks.

Getting this error when trying on the new main when running pytorch multi gpu example. Using this docker nvcr.io/nvidia/pytorch:21.11-py3. Only thing I added was pip install tensorflow

[cbafb74f202e:24025:0:24025] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x331)                                                                                                                                             
==== backtrace (tid:  24025) ====                                                                                                                                                                                                                             
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f5eeb2a3824]                                                                                                                                                                                    
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x2b9ff) [0x7f5eeb2a39ff]                                                                                                                                                                                                  
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x2bd34) [0x7f5eeb2a3d34]                                                                                                                                                                                                  
 3  /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZNK3c1010TensorImpl4dataEv+0xd) [0x7f5e0ef5948d]                                                                                                                                            
 4  /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZN9torch_ext5FTGptIfEC1EmmmmmiiiiiSt6vectorIN2at6TensorESaIS4_EES6_S6_+0x66a) [0x7f5e0ef6e5da]                                                                                              
 5  /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZN9torch_ext13ParallelGptOpC1EllllllllllSt6vectorIN2at6TensorESaIS3_EES5_S5_+0x417) [0x7f5e0ef57637]                                                                                        
 6  /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZN5torch6detail32call_torchbind_method_from_stackIZNS_6class_IN9torch_ext13ParallelGptOpEE3defIJllllllllllSt6vectorIN2at6TensorESaIS9_EESB_SB_EEERS5_NS0_5typesIvJDpT_EEENSt7__cxx1112basic_
stringIcSt11char_traitsIcESaIcEEESt16initializer_listINS_3argEEEUlN3c1014tagged_capsuleIS4_EEllllllllllSB_SB_SB_E_Lb0EJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7ELm8ELm9ELm10ELm11ELm12ELm13EEEENSQ_4guts23infer_function_traits_t11return_typeERT_RS7_INSQ_6IValueESaIS
Z_EESt16integer_sequenceImJXspT1_EEE+0x3de) [0x7f5e0ef7423e]                                                                                                                                                                                                  
 7  /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZNSt17_Function_handlerIFvRSt6vectorIN3c106IValueESaIS2_EEEZN5torch6class_IN9torch_ext13ParallelGptOpEE12defineMethodIZNSB_3defIJllllllllllS0_IN2at6TensorESaISF_EESH_SH_EEERSB_NS7_6detail5
typesIvJDpT_EEENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt16initializer_listINS7_3argEEEUlNS1_14tagged_capsuleISA_EEllllllllllSH_SH_SH_E_EEPNS7_3jit8FunctionEST_T_ST_SW_EUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x26) [0x7f5e0ef74806]              
 8  /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xf0474e) [0x7f5f6eacc74e]                                                                                                                                                           
 9  /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xf00517) [0x7f5f6eac8517]                                                                                                                                                           
10  /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xf028f3) [0x7f5f6eaca8f3]                                                                                                                                                           
11  /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xa1a295) [0x7f5f6e5e2295]                                                                                                                                                           
12  python(PyCFunction_Call+0x54) [0x55ac8f38bf44]                                                                                                                                                                                                            
13  python(_PyObject_MakeTpCall+0x31e) [0x55ac8f39b30e]                                                                                                                                                                                                       
14  python(+0x1b0f6e) [0x55ac8f410f6e]                                                                                                                                                                                                                        
15  python(PyObject_Call+0x5e) [0x55ac8f38516e]                                                                                                                                                                                                               
16  python(+0x17f42d) [0x55ac8f3df42d]                                                                                                                                                                                                                        
17  python(_PyObject_MakeTpCall+0x31e) [0x55ac8f39b30e]
18  python(_PyEval_EvalFrameDefault+0x53cf) [0x55ac8f4316ff]
19  python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
20  python(+0x1b08b7) [0x55ac8f4108b7]
21  python(_PyEval_EvalFrameDefault+0x4e03) [0x55ac8f431133]
22  python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
23  python(+0x1b08b7) [0x55ac8f4108b7]
24  python(_PyEval_EvalFrameDefault+0x4e03) [0x55ac8f431133]
25  python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
26  python(_PyFunction_Vectorcall+0x378) [0x55ac8f410198]
27  python(+0x1b1bbf) [0x55ac8f411bbf]
28  python(_PyObject_MakeTpCall+0x2eb) [0x55ac8f39b2db]
29  python(_PyEval_EvalFrameDefault+0x56cb) [0x55ac8f4319fb]
30  python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
31  python(_PyFunction_Vectorcall+0x378) [0x55ac8f410198]
32  python(_PyEval_EvalFrameDefault+0x947) [0x55ac8f42cc77]
33  python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
34  python(PyEval_EvalCodeEx+0x39) [0x55ac8f40fe19]
35  python(PyEval_EvalCode+0x1b) [0x55ac8f4b224b]

Apr 20 '22 00:04 bharatv007

Can you try on latest main branch? If you still encounter same problem, please provide the reproduce steps to reproduce this issue, thanks.

I get this warning while cmake in main, could this cause any issue with the make and pytorch

CMake Warning at /opt/conda/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:365 (message):                                                                       
 In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST                                                                                          
 to cmake instead of implicitly setting it as an env variable. This will                                                                                           
 become a FATAL_ERROR in future version of pytorch.                                                                                                      
Call Stack (most recent call first):                                                                                                              
 /opt/conda/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:444 (torch_cuda_get_nvcc_gencode_flag)                                                                   
 /opt/conda/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)                                                                               
 /opt/conda/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)                                                                              
 CMakeLists.txt:203 (find_package)                                                                                                              
                                                                                                                                
                                                                                                                                
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80                                                                                               
CMake Warning at /opt/conda/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):                                                                        
 static library kineto_LIBRARY-NOTFOUND not found.                                                                                                      
Call Stack (most recent call first):                                                                                                              
 /opt/conda/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)                                                                       
 CMakeLists.txt:203 (find_package)

Apr 20 '22 13:04 bharatv007

For pip install tensorflow, you should use TF docker directly. For warning in pytorch docker, it is fine.

May 09 '22 09:05 byshiue

Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.

Sep 06 '22 01:09 byshiue

FasterTransformer FasterTransformer copied to clipboard

Drop in Performance with TP>1, batch size and long generations

FasterTransformer
FasterTransformer copied to clipboard