wang jiahao
wang jiahao
作者您好,我在看您的代码时,有个地方很疑惑,在Encoder和Decoder中共有三层attention层,假设都采用您写的FullAttention模块,为什么Decoder中的第一个attention层要设置mix为True?这个参数我发现主要用于后面将多头合并成一个头的时候,将数据的最后两个维度交换,我不理解这里为什么要这么做,我在修改为False之后,效果变差了,想问下这个地方的原因是什么?
I have tried all the benchmarks of cloudsuite for arm64. There are a lot of problems. almost all the benchmark can not run by offical code. Web-serving have php-fpm problem....
### Name and Version /build/bin/llama-quantize /mnt/data/model/Moonlight-16B-A3B-Instruct/Moonlight-16B-A3B-Instruct-BF16.gguf /mnt/data/model/Moonlight-16B-A3B-Instruct/Moonlight-16B-A3B-Instruct-Q4_K_M.gguf Q4_K_M ### Operating systems Linux ### GGML backends CUDA ### Hardware RTX 4090D ### Models Moonlight-16B-A3B-Instruct ### Problem description & steps to reproduce...