mzchtx

Results 4 issues of mzchtx

Inspired by the paper Mixmatch, mixup can be used in supervised data. In this way, we can achieve improved performance, even better than the native UDA. ![image](https://user-images.githubusercontent.com/37564754/63161916-db905a80-c053-11e9-80e3-81e2497a6afb.png)

We found that the way of calculating coordinate mapping in CV-CUDA's resize is different from that of OpenCV (as shown in the pseudo-code in the figure below): - [OpenCV uses...

question
? - needs triage

I think we can slice k, v and mask before calling `F.scaled_dot_product_attention()` to reduce the calculation, otherwise the calculation is the same as max_seq_len even when input_pos is relatively small...

I think we can use index_copy_` the inplace version of index_copy` to reduce the extra cache creation and copying https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/model.py#L217-L218 ![image](https://github.com/Lightning-AI/lit-llama/assets/37564754/34001d79-ccea-4bc2-89ea-d947c10a0ebe)