FasterTransformer
FasterTransformer copied to clipboard
Drop in Performance with TP>1, batch size and long generations
Noticed a small drop in performance (<1%) for the same model when we use tensor sharding. For example, A 355 million parameter model with tp=1, gets an accuracy of 47.30 on FasterTransformer (expected performance for my model) The same model with tp=4, gets an accuracy of 46.9 on FasterTransformer. Has this been noticed before.
I am using the pytorch docker nvcr.io/nvidia/pytorch:20.12-py3
contains the PyTorch 1.8.0 and python 3.8
Can you try different runtime settings like topk/topp/random seed? I guess the reason is because using tensor parallelism modifies the shape of GEMM, and hence the results are little different and get different accuracy. But in general, they should be on same level and sometime one is better, sometime another one is better.
We also anlaysed batching, we find that for batch >8, the accuracy is a bit stochastic between test, whereas for batch <=8, the results are consistent. This for tp=1. batch 1: 0.4740 batch 8: 0.4740 batch 16: 0.4604 batch 16 again: 0.3530 batch 16 again: 0.4670
The variance is too high, has this been observed before? Wanna understand if its something on our end? Thank you
Can you try on latest main branch? If you still encounter same problem, please provide the reproduce steps to reproduce this issue, thanks.
We built on top of the dev_5.0 branch, are there any chnages in main that is different to dev_5.0 branch. Important changes we made are
- We get the logits from context decoder (gpt_context_decoder) too and pass it through a softmax layer and to get log likelihoods of the initial prompt too (we wrote a custom kernel for the log of softmax).
- We also made a similar log likelihood step for the gpt_decoder_ too.
The accuracy is computed from the log likelihoods and the task we use is Hellaswag. Is there another better way for us to share the code (without making a PR)?
We may have fixed some bugs affecting the stability. The latest branch also provide some similar features in the gpt model, they may be helpful. I am not sure the "again" you mean. But if you run the program like "gpt_example" many times with same inputs, they should generate same results. If you still encounter problem of un-stability, we can try to reproduce this problem and solve it first.
Thanks for the quick reply! I will try merging with main and test it!
We may have fixed some bugs affecting the stability. The latest branch also provide some similar features in the gpt model, they may be helpful. I am not sure the "again" you mean. But if you run the program like "gpt_example" many times with same inputs, they should generate same results. If you still encounter problem of un-stability, we can try to reproduce this problem and solve it first.
Step 2. refers to getting the logprobs of the generated tokens too.
I am using the pytorch docker nvcr.io/nvidia/pytorch:20.12-py3
contains the PyTorch 1.8.0 and python 3.8. Should this be changed to run on the main?
We only verified on the docker image we mention in the document.
We also observe that the sampling/generation quality degrades after 1024 tokens (we compare the generations against a default tensorflow-2.8, we dont see this issue, samples are not gibberish), it starts outputting giberrish sentences. Is there some instability in topp sampling after 1024 tokens? dev_5.0 branch
We just run python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --fp16 --ckpt_path /355m-export --output_len=1500 --max_seq_len=2048
We may have fixed some bugs affecting the stability. The latest branch also provide some similar features in the gpt model, they may be helpful. I am not sure the "again" you mean. But if you run the program like "gpt_example" many times with same inputs, they should generate same results. If you still encounter problem of un-stability, we can try to reproduce this problem and solve it first.
Can you please summarize what are the main bugs that were fixed for stability?
We also observe that the sampling/generation quality degrades after 1024 tokens (we compare the generations against a default tensorflow-2.8, we dont see this issue, samples are not gibberish), it starts outputting giberrish sentences. Is there some instability in topp sampling after 1024 tokens? dev_5.0 branch
are these kind of stabilities fixed in main, like long context generations (>1024 tokens) with topp etc?
We also observe that the sampling/generation quality degrades after 1024 tokens (we compare the generations against a default tensorflow-2.8, we dont see this issue, samples are not gibberish), it starts outputting giberrish sentences. Is there some instability in topp sampling after 1024 tokens? dev_5.0 branch
We just run
python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --fp16 --ckpt_path /355m-export --output_len=1500 --max_seq_len=2048
we are still adding the changes we made in the dev_5.0 branch to main. Here is one way you can reproduce the generation quality issue that I mentioned above. When we run python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --fp16 --ckpt_path /355m-export --top_p=1.0 --top_k=0 --temperature=1.0 --output_len=1500 --max_seq_len=2048
we see that the models generation is horrible and this problem does not occur when we run on vanilla Tensorflow
Can you try on latest main branch? If you still encounter same problem, please provide the reproduce steps to reproduce this issue, thanks.
Getting this error when trying on the new main when running pytorch multi gpu example. Using this docker nvcr.io/nvidia/pytorch:21.11-py3. Only thing I added was pip install tensorflow
[cbafb74f202e:24025:0:24025] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x331)
==== backtrace (tid: 24025) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f5eeb2a3824]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x2b9ff) [0x7f5eeb2a39ff]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x2bd34) [0x7f5eeb2a3d34]
3 /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZNK3c1010TensorImpl4dataEv+0xd) [0x7f5e0ef5948d]
4 /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZN9torch_ext5FTGptIfEC1EmmmmmiiiiiSt6vectorIN2at6TensorESaIS4_EES6_S6_+0x66a) [0x7f5e0ef6e5da]
5 /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZN9torch_ext13ParallelGptOpC1EllllllllllSt6vectorIN2at6TensorESaIS3_EES5_S5_+0x417) [0x7f5e0ef57637]
6 /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZN5torch6detail32call_torchbind_method_from_stackIZNS_6class_IN9torch_ext13ParallelGptOpEE3defIJllllllllllSt6vectorIN2at6TensorESaIS9_EESB_SB_EEERS5_NS0_5typesIvJDpT_EEENSt7__cxx1112basic_
stringIcSt11char_traitsIcESaIcEEESt16initializer_listINS_3argEEEUlN3c1014tagged_capsuleIS4_EEllllllllllSB_SB_SB_E_Lb0EJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7ELm8ELm9ELm10ELm11ELm12ELm13EEEENSQ_4guts23infer_function_traits_t11return_typeERT_RS7_INSQ_6IValueESaIS
Z_EESt16integer_sequenceImJXspT1_EEE+0x3de) [0x7f5e0ef7423e]
7 /workspace/FasterTransformer/build/lib/libth_parallel_gpt.so(_ZNSt17_Function_handlerIFvRSt6vectorIN3c106IValueESaIS2_EEEZN5torch6class_IN9torch_ext13ParallelGptOpEE12defineMethodIZNSB_3defIJllllllllllS0_IN2at6TensorESaISF_EESH_SH_EEERSB_NS7_6detail5
typesIvJDpT_EEENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt16initializer_listINS7_3argEEEUlNS1_14tagged_capsuleISA_EEllllllllllSH_SH_SH_E_EEPNS7_3jit8FunctionEST_T_ST_SW_EUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x26) [0x7f5e0ef74806]
8 /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xf0474e) [0x7f5f6eacc74e]
9 /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xf00517) [0x7f5f6eac8517]
10 /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xf028f3) [0x7f5f6eaca8f3]
11 /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xa1a295) [0x7f5f6e5e2295]
12 python(PyCFunction_Call+0x54) [0x55ac8f38bf44]
13 python(_PyObject_MakeTpCall+0x31e) [0x55ac8f39b30e]
14 python(+0x1b0f6e) [0x55ac8f410f6e]
15 python(PyObject_Call+0x5e) [0x55ac8f38516e]
16 python(+0x17f42d) [0x55ac8f3df42d]
17 python(_PyObject_MakeTpCall+0x31e) [0x55ac8f39b30e]
18 python(_PyEval_EvalFrameDefault+0x53cf) [0x55ac8f4316ff]
19 python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
20 python(+0x1b08b7) [0x55ac8f4108b7]
21 python(_PyEval_EvalFrameDefault+0x4e03) [0x55ac8f431133]
22 python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
23 python(+0x1b08b7) [0x55ac8f4108b7]
24 python(_PyEval_EvalFrameDefault+0x4e03) [0x55ac8f431133]
25 python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
26 python(_PyFunction_Vectorcall+0x378) [0x55ac8f410198]
27 python(+0x1b1bbf) [0x55ac8f411bbf]
28 python(_PyObject_MakeTpCall+0x2eb) [0x55ac8f39b2db]
29 python(_PyEval_EvalFrameDefault+0x56cb) [0x55ac8f4319fb]
30 python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
31 python(_PyFunction_Vectorcall+0x378) [0x55ac8f410198]
32 python(_PyEval_EvalFrameDefault+0x947) [0x55ac8f42cc77]
33 python(_PyEval_EvalCodeWithName+0x2c3) [0x55ac8f40edb3]
34 python(PyEval_EvalCodeEx+0x39) [0x55ac8f40fe19]
35 python(PyEval_EvalCode+0x1b) [0x55ac8f4b224b]
Can you try on latest main branch? If you still encounter same problem, please provide the reproduce steps to reproduce this issue, thanks.
I get this warning while cmake in main, could this cause any issue with the make and pytorch
CMake Warning at /opt/conda/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:365 (message):
In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
to cmake instead of implicitly setting it as an env variable. This will
become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
/opt/conda/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:444 (torch_cuda_get_nvcc_gencode_flag)
/opt/conda/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
/opt/conda/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:203 (find_package)
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
CMake Warning at /opt/conda/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
/opt/conda/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
CMakeLists.txt:203 (find_package)
For pip install tensorflow
, you should use TF docker directly.
For warning in pytorch docker, it is fine.
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.