mzchtx issues

Results 4 issues of


mzchtx

Using mixup in supervised data

Inspired by the paper Mixmatch, mixup can be used in supervised data. In this way, we can achieve improved performance, even better than the native UDA. ![image](https://user-images.githubusercontent.com/37564754/63161916-db905a80-c053-11e9-80e3-81e2497a6afb.png)

[QUESTION] Difference between CV-CUDA resize and OpenCV resize

We found that the way of calculating coordinate mapping in CV-CUDA's resize is different from that of OpenCV (as shown in the pseudo-code in the figure below): - [OpenCV uses...

question

? - needs triage

Slice before F.scaled_dot_product_attention() to improve the performance

I think we can slice k, v and mask before calling `F.scaled_dot_product_attention()` to reduce the calculation, otherwise the calculation is the same as max_seq_len even when input_pos is relatively small...

Use index_copy_ to reduce memory copies

I think we can use index_copy_` the inplace version of index_copy` to reduce the extra cache creation and copying https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/model.py#L217-L218 ![image](https://github.com/Lightning-AI/lit-llama/assets/37564754/34001d79-ccea-4bc2-89ea-d947c10a0ebe)