dongluw
dongluw
https://github.com/NVIDIA/FasterTransformer/blob/c6e8f60ec40da218804a60e6aa986903e7fa8594/src/fastertransformer/models/multi_gpu_gpt/ParallelGptWeight.cc#L259 here it allocates and copies `max_seq_len_ * vocab_size_` for weights_ptr[0] (position embeddings), but when loading the weights `max_seq_len_ * hidden_units_` is used https://github.com/NVIDIA/FasterTransformer/blob/c6e8f60ec40da218804a60e6aa986903e7fa8594/src/fastertransformer/models/multi_gpu_gpt/ParallelGptWeight.cc#L299 if so we might allocate more...
Before the attention operation the qkv tensors are implemented as one big tensor `qkv`, I would like to do some in-place operations for q and k only. Currently what I...
there are two `gen_random_start_ids` in tools/utils/utils.py https://github.com/triton-inference-server/tensorrtllm_backend/blob/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe/tools/utils/utils.py#L238-L248 https://github.com/triton-inference-server/tensorrtllm_backend/blob/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe/tools/utils/utils.py#L270-L280