Hongyi Jin

Results 3 issues of Hongyi Jin

Vicuna v0's vocab_size is 32001, but v1's vocab size is 32000. So we need to update the manual schedule.

This PR enables weight compression in GPU. Previously the weight compression is run in CPU because the uncompressed weight is too large to fit in GPU, and running on CPU...

1. add a dlight rule LowBatchGEMV to schedule low-batch GEMM just like GEMV. 2. fix some issues when lowering low-batch GEMM