Dachuan Shi
Dachuan Shi
1. `--w_sp_attn`和`--w_sp_mlp`分别用来控制attention和ffn上learnable mask的loss值。遵循了两点设置:(1) 使这两个loss值在训练开始时是相等的,这两个值在训练过程中也会打印出来。(2) 让它们的值在search阶段结束后和模型原有损失函数的loss值在同一数量级上。具体数值没有仔细调过,也可尝试下其他设置。 2. `epochs-search`与具体任务有关。多模态原本用于训练的epoch较少,因此直接把搜索的epoch设置为和训练相同。单模态原本用于训练的epoch较多,因此搜索的epoch仅用了训练的约1/5。条件允许的话一般`epochs-search`越多越好。`interval`论文里有解释,它代表的是间隔多少个iteration更新一次learnable mask的参数。建议大约设置为1%的压缩率对应的iteration数。例如要在1000次iteration里实现50%的压缩,可设置为1000/50=20上下。 3. UPop实现的是结构化剪枝,这个数字是为了让attention和ffn的learnable mask中每一个位置对应的实际覆盖的参数数量相同,`9234/769`的得到过程为: * 分子(attn):[384(attn中qkv输入一行的参数个数为384)+1(attn中qkv输入一行对应的bias参数数量为1)] $\times$ [1(query)+1(key)+1(value)] $\times$ 6(heads数量) + 384(attn中proj输出一列的参数个数为384) $\times$ 6(heads数量) = (384+1) $\times$ (1+1+1) $\times$ 6+384 $\times$ 6 =...
基于Transformer的模型应该都可以。 注意初始化的mask参数的shape,其在forward过程中能正常乘到value上即可。
不主动清除梯度即可。mask的参数没有被包括在原模型的优化器中,例如: https://github.com/sdc17/UPop/blob/6aae798a9a576cf001ab1ca27b5afc15cbeeda46/compress_retrieval_clip.py#L278-L282 ,它的梯度`.grad`会随着迭代的进行自动累加。
BTW, llama2-13b-chat-hf and llama2-70b-chat-hf models ran into the same mismatch problem.
By fvcore package. See here https://github.com/sdc17/UPop/blob/6aae798a9a576cf001ab1ca27b5afc15cbeeda46/deit/main.py#L343-L350
Hi, Your understanding is correct! In fact, We mentioned this in footnote3 on page 9.
Hi, I don't remember there was an issue when I calculated FLOPs for BLIP retrieval. If this is indeed an issue of negative samples, perhaps you can modify the input...
Hi, You can delete or comment anything related to "petrel" such as: https://github.com/sdc17/UPop/blob/6aae798a9a576cf001ab1ca27b5afc15cbeeda46/clip/clip.py#L16 This is an optional package used to load data. The code will also work without this package.
Hi, Torch 1.x is recommended. https://github.com/sdc17/UPop/blob/6aae798a9a576cf001ab1ca27b5afc15cbeeda46/README.md?plain=1#L76
Hi, We used [calflops](https://github.com/MrYxJ/calculate-flops.pytorch) package to calculate the FLOPs for LLaVA.