TM

Results 3 issues of TM

Dear developers, I am trying to reproduce the [bert benchmarking result](https://github.com/Tencent/TurboTransformers/blob/master/docs/bert.md) on my machine. ![image](https://user-images.githubusercontent.com/4970790/87907044-682e6e80-ca96-11ea-9826-3f2ec46318c2.png) I just run `bash run_gpu_benchmark.sh` but the QPS is much slower than the declared value....

bug

Hi all, I was using DALI with PyTorch recently and am impressed by its excellent performance. Currently all of my training data have to be placed on local SSD storage...

enhancement
external contribution welcome

Dynamic shared memory of [GEMV](https://github.com/mit-han-lab/llm-awq/blob/main/awq/kernels/csrc/quantization_new/gemv/gemv_cuda.cu#L103) kernel is not allocated when calling GEMV kernel which causes Illegal Memory Access error. This pull request fixes above issue by specifying shared memory size...